[2025-11-13 08:04:10,156][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:10,988][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:04:10,994][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:12,060][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:06:23,298][__main__][INFO] - Starting iteration 0. [2025-11-13 08:06:23,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:23,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:28,035][__main__][INFO] - Number of regex retries in iteration 0: 0 [2025-11-13 08:06:28,036][__main__][INFO] - agents played in iteration 0 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:06:28,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:28,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:28,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:28,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:28,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:28,615][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:29,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:30,296][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:30,624][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:31,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:32,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:32,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:33,212][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:33,536][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:34,508][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:34,831][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:35,154][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:35,478][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:35,804][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:36,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:36,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:37,761][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:38,407][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:38,731][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:39,380][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:39,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:40,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:40,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.58%, Current % of VRAM taken: 42.03%, Block Peak % of device VRAM: 25.21%, ΔTime: 00:00:11 [2025-11-13 08:06:41,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:41,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:41,494][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:42,629][__main__][INFO] - Iteration 1 took 19s (24.48% Gen, 69.63% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 3m 27s. Estimated total time: 16h 6m 21s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 12s, 500 more iterations: 2h 41m 3s. [2025-11-13 08:06:42,652][__main__][INFO] - Starting iteration 1. [2025-11-13 08:06:42,656][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:42,656][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:46,422][__main__][INFO] - Number of regex retries in iteration 1: 0 [2025-11-13 08:06:46,423][__main__][INFO] - agents played in iteration 1 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:06:46,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:46,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:46,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:46,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:46,993][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:46,993][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:47,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:48,366][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:48,689][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:49,014][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:49,340][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:49,995][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:50,661][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:50,992][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:51,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:51,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:51,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:52,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:53,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:53,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:54,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:55,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:55,895][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:56,551][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:56,877][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:57,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:58,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:58,908][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:06:59,637][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:59,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:59,640][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:00,729][__main__][INFO] - Iteration 2 took 18s (20.84% Gen, 73.13% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 0m 29s. Estimated total time: 15h 3m 41s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 7s, 500 more iterations: 2h 30m 36s. [2025-11-13 08:07:00,731][__main__][INFO] - Starting iteration 2. [2025-11-13 08:07:00,734][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:00,735][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:04,473][__main__][INFO] - Number of regex retries in iteration 2: 0 [2025-11-13 08:07:04,473][__main__][INFO] - agents played in iteration 2 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:07:04,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:04,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:05,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:05,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:05,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:05,083][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:05,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:06,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:06,769][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:07,742][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:08,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:08,390][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:09,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:09,363][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:10,338][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:10,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:10,990][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:11,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:11,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:11,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:12,610][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:13,260][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:13,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:13,915][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:14,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:14,891][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:15,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:16,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:16,914][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:17,656][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:17,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:17,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:18,777][__main__][INFO] - Iteration 3 took 18s (20.72% Gen, 73.08% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 41s. Estimated total time: 15h 2m 11s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 4s, 500 more iterations: 2h 30m 21s. [2025-11-13 08:07:18,779][__main__][INFO] - Starting iteration 3. [2025-11-13 08:07:18,783][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:18,783][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:22,595][__main__][INFO] - Number of regex retries in iteration 3: 0 [2025-11-13 08:07:22,595][__main__][INFO] - agents played in iteration 3 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:07:23,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:23,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:23,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:23,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:23,169][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:23,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:23,918][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:24,215][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:24,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:25,193][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:25,518][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:25,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:26,817][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:27,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:27,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:27,793][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:28,120][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:29,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:29,421][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:30,086][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:31,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:31,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:31,720][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:32,372][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:32,703][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:33,028][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:33,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:33,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:34,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:34,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:35,059][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:35,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:35,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:35,810][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:36,825][__main__][INFO] - Iteration 4 took 18s (21.13% Gen, 73.24% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 22s. Estimated total time: 15h 2m 10s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 4s, 500 more iterations: 2h 30m 21s. [2025-11-13 08:07:36,827][__main__][INFO] - Starting iteration 4. [2025-11-13 08:07:36,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:36,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:40,596][__main__][INFO] - Number of regex retries in iteration 4: 0 [2025-11-13 08:07:40,597][__main__][INFO] - agents played in iteration 4 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:07:41,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:41,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:41,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:41,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:41,177][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:41,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:41,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:42,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:43,221][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:44,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:44,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:44,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:45,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:45,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:45,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:46,141][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:46,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:46,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:47,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:48,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:48,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:48,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:49,063][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:49,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:50,693][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:51,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:51,341][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:51,989][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:52,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:53,045][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:53,818][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:53,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:53,822][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:54,856][__main__][INFO] - Iteration 5 took 18s (20.89% Gen, 73.37% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 57m 12s. Estimated total time: 15h 1m 18s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 2s, 500 more iterations: 2h 30m 13s. [2025-11-13 08:07:54,858][__main__][INFO] - Starting iteration 5. [2025-11-13 08:07:54,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:54,862][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:58,636][__main__][INFO] - Number of regex retries in iteration 5: 0 [2025-11-13 08:07:58,637][__main__][INFO] - agents played in iteration 5 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:07:59,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:59,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:59,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:59,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:59,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:59,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:59,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:00,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:00,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:01,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:01,885][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:02,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:02,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:03,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:03,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:03,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:04,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:04,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:05,154][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:05,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:06,148][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:06,472][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:06,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:07,123][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:07,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:08,105][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:08,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:08,750][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:09,074][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:09,402][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:09,734][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:10,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:10,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:11,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:11,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:11,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:11,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:12,873][__main__][INFO] - Iteration 6 took 18s (20.95% Gen, 73.43% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 56m 13s. Estimated total time: 15h 0m 37s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 1s, 500 more iterations: 2h 30m 6s. [2025-11-13 08:08:12,875][__main__][INFO] - Starting iteration 6. [2025-11-13 08:08:12,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:12,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:16,712][__main__][INFO] - Number of regex retries in iteration 6: 0 [2025-11-13 08:08:16,712][__main__][INFO] - agents played in iteration 6 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:08:17,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:17,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:17,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:17,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:17,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:17,309][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:18,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:18,674][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:18,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:19,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:19,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:19,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:20,298][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:20,625][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:20,950][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:21,275][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:21,599][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:21,924][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:22,248][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:22,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:23,224][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:23,549][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:24,200][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:24,525][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:24,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:25,175][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:25,510][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:25,835][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:26,159][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:26,484][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:26,815][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:27,142][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:27,470][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:27,794][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:28,122][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:28,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:29,189][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:29,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:29,939][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:29,941][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:31,009][__main__][INFO] - Iteration 7 took 18s (21.14% Gen, 72.96% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 1m 50s. Estimated total time: 15h 6m 32s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 13s, 500 more iterations: 2h 31m 5s. [2025-11-13 08:08:31,011][__main__][INFO] - Starting iteration 7. [2025-11-13 08:08:31,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:31,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:34,798][__main__][INFO] - Number of regex retries in iteration 7: 0 [2025-11-13 08:08:34,799][__main__][INFO] - agents played in iteration 7 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:08:35,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:35,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:35,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:35,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:35,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:35,379][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:36,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:36,441][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:37,093][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:37,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:37,748][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:38,076][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:38,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:38,726][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:39,054][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:39,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:40,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:40,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:40,677][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:41,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:41,664][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:41,988][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:42,313][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:42,965][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:43,293][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:43,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:43,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:44,274][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:44,600][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:44,924][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:45,249][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:45,576][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:45,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:46,225][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:46,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:47,289][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:48,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:48,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:48,059][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:49,277][__main__][INFO] - Iteration 8 took 18s (20.71% Gen, 72.61% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 8m 8s. Estimated total time: 15h 13m 8s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 26s, 500 more iterations: 2h 32m 11s. [2025-11-13 08:08:49,280][__main__][INFO] - Starting iteration 8. [2025-11-13 08:08:49,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:49,283][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:53,002][__main__][INFO] - Number of regex retries in iteration 8: 0 [2025-11-13 08:08:53,003][__main__][INFO] - agents played in iteration 8 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:08:53,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:53,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:53,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:53,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:53,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:53,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:54,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:54,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:55,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:55,920][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:56,248][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:56,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:56,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:57,224][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:57,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:58,198][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:58,522][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:58,847][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:59,172][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:59,499][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:00,151][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:00,799][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:01,124][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:01,774][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:02,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:03,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:03,401][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:03,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:04,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:04,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:04,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:05,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:06,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:06,221][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:06,223][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:07,239][__main__][INFO] - Iteration 9 took 17s (20.72% Gen, 73.62% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 52m 33s. Estimated total time: 14h 57m 52s. Time estimates for 10 more iterations: 2m 59s, 100 more iterations: 29m 55s, 500 more iterations: 2h 29m 38s. [2025-11-13 08:09:07,241][__main__][INFO] - Starting iteration 9. [2025-11-13 08:09:07,245][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:07,246][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:10,948][__main__][INFO] - Number of regex retries in iteration 9: 0 [2025-11-13 08:09:10,949][__main__][INFO] - agents played in iteration 9 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:09:11,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:11,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:11,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:11,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:11,539][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:11,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:12,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:13,248][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:13,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:13,897][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:14,222][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:14,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:14,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:15,196][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:15,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:15,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:16,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:16,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:17,148][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:17,474][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:17,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:18,123][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:18,448][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:19,099][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:19,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:19,752][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:21,052][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:21,377][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:21,702][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:22,026][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:22,352][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:22,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:23,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:24,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:24,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:24,164][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:25,177][__main__][INFO] - Iteration 10 took 17s (20.65% Gen, 73.69% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 51m 2s. Estimated total time: 14h 56m 38s. Time estimates for 10 more iterations: 2m 59s, 100 more iterations: 29m 53s, 500 more iterations: 2h 29m 26s. [2025-11-13 08:09:25,179][__main__][INFO] - Starting iteration 10. [2025-11-13 08:09:25,182][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:25,183][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:27,710][mllm.models.large_language_model_local][WARNING] - Response >A< did not match regex: (|), retry 1/1 [2025-11-13 08:09:29,530][__main__][INFO] - Number of regex retries in iteration 10: 1 [2025-11-13 08:09:29,531][__main__][INFO] - agents played in iteration 10 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:09:29,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:30,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:30,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:30,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:30,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:30,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:31,105][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:31,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:32,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:32,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:33,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:34,015][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:34,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:34,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:35,307][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:35,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:35,959][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:36,603][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:36,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:37,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:37,577][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:37,901][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:38,231][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:38,553][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:38,878][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:39,200][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:39,531][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:39,853][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:40,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:40,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:41,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:41,891][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:42,636][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:42,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:42,639][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:44,841][__main__][INFO] - Iteration 11 took 19s (22.12% Gen, 66.68% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 17m 2s. Estimated total time: 16h 22m 58s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 45s, 500 more iterations: 2h 43m 49s. [2025-11-13 08:09:44,843][__main__][INFO] - Starting iteration 11. [2025-11-13 08:09:44,846][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:44,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:49,179][__main__][INFO] - Number of regex retries in iteration 11: 0 [2025-11-13 08:09:49,180][__main__][INFO] - agents played in iteration 11 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:09:49,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:49,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:49,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:49,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:49,740][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:49,740][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:50,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:50,772][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:51,095][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:53,060][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:53,707][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:54,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:54,359][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:54,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:55,009][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:55,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:55,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:55,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:56,301][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:56,953][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:57,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:57,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:57,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:58,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:58,579][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:58,901][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:59,232][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:59,558][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:59,880][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:00,203][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:00,534][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:00,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:01,591][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:02,361][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:02,362][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:02,364][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:03,357][__main__][INFO] - Iteration 12 took 18s (23.41% Gen, 71.22% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 19m 22s. Estimated total time: 15h 25m 36s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 51s, 500 more iterations: 2h 34m 16s. [2025-11-13 08:10:03,359][__main__][INFO] - Starting iteration 12. [2025-11-13 08:10:03,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:03,362][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:07,555][__main__][INFO] - Number of regex retries in iteration 12: 0 [2025-11-13 08:10:07,555][__main__][INFO] - agents played in iteration 12 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:10:08,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:08,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:08,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:08,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:08,115][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:08,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:09,140][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:09,466][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:09,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:10,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:10,764][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:11,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:11,407][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:11,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:12,058][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:12,378][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:12,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:13,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:13,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:13,992][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:14,315][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:14,960][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:15,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:15,926][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:16,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:16,574][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:16,896][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:17,219][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:17,542][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:18,186][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:18,509][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:18,831][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:19,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:19,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:20,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:20,641][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:20,643][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:21,678][__main__][INFO] - Iteration 13 took 18s (22.89% Gen, 71.45% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 9m 18s. Estimated total time: 15h 15m 51s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 31s, 500 more iterations: 2h 32m 38s. [2025-11-13 08:10:21,680][__main__][INFO] - Starting iteration 13. [2025-11-13 08:10:21,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:21,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:25,875][__main__][INFO] - Number of regex retries in iteration 13: 0 [2025-11-13 08:10:25,876][__main__][INFO] - agents played in iteration 13 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:10:26,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:26,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:26,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:26,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:26,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:26,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:27,191][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:27,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:27,815][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:28,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:28,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:29,120][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:29,449][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:29,774][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:30,099][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:30,751][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:31,077][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:31,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:32,379][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:32,707][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:33,357][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:34,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:34,973][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:35,625][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:35,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:36,924][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:37,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:38,347][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:39,107][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:39,108][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:39,110][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:40,137][__main__][INFO] - Iteration 14 took 18s (22.71% Gen, 71.72% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 15m 52s. Estimated total time: 15h 22m 44s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 45s, 500 more iterations: 2h 33m 47s. [2025-11-13 08:10:40,139][__main__][INFO] - Starting iteration 14. [2025-11-13 08:10:40,142][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:40,143][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:44,340][__main__][INFO] - Number of regex retries in iteration 14: 0 [2025-11-13 08:10:44,340][__main__][INFO] - agents played in iteration 14 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:10:44,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:44,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:44,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:44,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:44,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:44,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:45,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:45,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:46,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:46,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:47,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:47,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:48,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:48,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:49,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:49,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:49,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:50,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:50,763][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:51,085][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:51,408][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:51,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:52,053][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:52,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:52,698][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:53,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:53,358][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:53,680][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:54,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:54,983][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:55,306][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:55,640][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:55,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:56,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:57,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:57,476][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:57,478][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:58,633][__main__][INFO] - Iteration 15 took 18s (22.70% Gen, 71.04% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 17m 25s. Estimated total time: 15h 24m 35s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 49s, 500 more iterations: 2h 34m 5s. [2025-11-13 08:10:58,636][__main__][INFO] - Starting iteration 15. [2025-11-13 08:10:58,639][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:58,640][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:02,838][__main__][INFO] - Number of regex retries in iteration 15: 0 [2025-11-13 08:11:02,839][__main__][INFO] - agents played in iteration 15 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:11:03,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:03,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:03,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:03,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:03,401][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:03,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:04,140][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:05,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:06,053][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:06,377][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:07,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:07,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:08,317][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:08,640][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:08,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:09,611][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:09,942][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:10,588][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:10,910][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:11,242][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:12,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:12,543][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:13,191][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:13,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:13,845][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:14,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:15,250][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:15,998][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:15,999][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:16,001][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:16,990][__main__][INFO] - Iteration 16 took 18s (22.87% Gen, 71.72% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 10m 6s. Estimated total time: 15h 17m 34s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 35s, 500 more iterations: 2h 32m 55s. [2025-11-13 08:11:16,992][__main__][INFO] - Starting iteration 16. [2025-11-13 08:11:16,995][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:16,996][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:21,234][__main__][INFO] - Number of regex retries in iteration 16: 0 [2025-11-13 08:11:21,235][__main__][INFO] - agents played in iteration 16 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:11:21,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:21,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:21,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:21,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:21,808][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:21,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:22,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:23,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:24,172][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:25,139][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:25,786][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:26,110][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:26,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:26,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:27,081][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:27,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:27,727][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:29,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:29,668][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:30,316][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:30,638][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:30,962][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:31,285][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:31,608][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:31,930][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:32,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:32,578][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:32,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:33,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:34,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:34,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:34,405][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:35,392][__main__][INFO] - Iteration 17 took 18s (23.04% Gen, 71.59% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 12m 6s. Estimated total time: 15h 19m 52s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 39s, 500 more iterations: 2h 33m 18s. [2025-11-13 08:11:35,394][__main__][INFO] - Starting iteration 17. [2025-11-13 08:11:35,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:35,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:39,790][__main__][INFO] - Number of regex retries in iteration 17: 0 [2025-11-13 08:11:39,790][__main__][INFO] - agents played in iteration 17 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:11:40,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:40,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:40,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:40,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:40,355][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:40,355][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:41,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:41,713][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:42,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:43,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:44,626][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:44,952][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:45,274][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:45,598][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:46,244][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:46,569][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:47,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:47,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:48,183][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:48,505][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:48,830][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:49,153][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:49,478][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:49,802][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:50,124][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:50,447][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:51,093][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:51,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:52,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:52,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:52,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:52,916][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:53,915][__main__][INFO] - Iteration 18 took 18s (23.72% Gen, 70.88% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 17m 49s. Estimated total time: 15h 25m 54s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 51s, 500 more iterations: 2h 34m 19s. [2025-11-13 08:11:53,917][__main__][INFO] - Starting iteration 18. [2025-11-13 08:11:53,920][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:53,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:58,189][__main__][INFO] - Number of regex retries in iteration 18: 0 [2025-11-13 08:11:58,189][__main__][INFO] - agents played in iteration 18 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:11:58,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:58,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:58,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:58,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:58,755][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:58,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:59,504][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:59,799][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:00,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:01,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:01,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:01,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:02,386][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:03,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:03,676][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:04,000][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:04,322][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:04,647][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:04,972][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:05,298][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:05,620][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:05,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:06,589][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:06,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:07,234][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:07,557][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:07,878][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:08,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:08,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:09,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:10,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:11,364][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:11,365][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:11,367][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:12,358][__main__][INFO] - Iteration 19 took 18s (23.15% Gen, 71.46% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 13m 34s. Estimated total time: 15h 21m 58s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 43s, 500 more iterations: 2h 33m 39s. [2025-11-13 08:12:12,361][__main__][INFO] - Starting iteration 19. [2025-11-13 08:12:12,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:12,365][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:14,590][mllm.models.large_language_model_local][WARNING] - Response %A did not match regex: (|), retry 1/1 [2025-11-13 08:12:16,659][__main__][INFO] - Number of regex retries in iteration 19: 1 [2025-11-13 08:12:16,659][__main__][INFO] - agents played in iteration 19 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:12:17,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:17,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:17,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:17,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:17,234][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:17,234][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:18,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:18,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:18,905][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:19,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:19,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:21,509][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:22,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:23,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:23,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:24,115][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:24,439][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:24,763][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:25,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:25,408][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:25,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:26,061][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:26,383][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:27,028][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:27,360][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:27,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:28,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:28,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:29,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:29,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:29,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:29,861][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:30,863][__main__][INFO] - Iteration 20 took 18s (23.21% Gen, 71.36% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 16m 17s. Estimated total time: 15h 24m 59s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 49s, 500 more iterations: 2h 34m 9s. [2025-11-13 08:12:30,865][__main__][INFO] - Starting iteration 20. [2025-11-13 08:12:30,869][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:30,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:35,167][__main__][INFO] - Number of regex retries in iteration 20: 0 [2025-11-13 08:12:35,169][__main__][INFO] - agents played in iteration 20 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:12:35,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:35,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:35,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:35,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:35,737][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:35,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:36,760][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:37,083][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:37,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:37,728][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:38,050][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:38,372][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:39,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:40,005][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:40,331][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:40,984][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:41,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:42,284][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:42,610][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:42,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:43,912][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:44,235][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:44,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:45,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:45,535][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:45,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:46,195][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:46,519][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:46,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:47,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:48,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:48,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:48,347][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:50,322][__main__][INFO] - Iteration 21 took 19s (22.11% Gen, 67.74% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 3m 41s. Estimated total time: 16h 12m 42s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 25s, 500 more iterations: 2h 42m 7s. [2025-11-13 08:12:50,324][__main__][INFO] - Starting iteration 21. [2025-11-13 08:12:50,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:12:50,328][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:54,896][__main__][INFO] - Number of regex retries in iteration 21: 0 [2025-11-13 08:12:54,897][__main__][INFO] - agents played in iteration 21 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:12:55,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:55,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:55,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:55,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:55,503][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:55,503][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:56,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:56,551][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:56,876][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:57,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:57,527][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:57,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:58,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:58,502][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:58,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:59,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:00,125][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:00,770][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:01,093][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:01,417][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:02,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:02,384][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:02,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:03,354][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:03,678][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:04,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:04,324][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:05,937][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:06,260][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:06,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:07,318][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:08,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:08,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:08,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:09,050][__main__][INFO] - Iteration 22 took 18s (24.40% Gen, 70.36% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 26m 50s. Estimated total time: 15h 36m 10s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 12s, 500 more iterations: 2h 36m 1s. [2025-11-13 08:13:09,052][__main__][INFO] - Starting iteration 22. [2025-11-13 08:13:09,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:09,055][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:13,276][__main__][INFO] - Number of regex retries in iteration 22: 0 [2025-11-13 08:13:13,276][__main__][INFO] - agents played in iteration 22 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:13:13,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:13,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:13,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:13,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:13,862][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:13,862][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:14,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:15,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:16,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:16,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:17,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:17,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:18,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:18,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:19,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:19,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:21,049][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:21,703][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:22,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:22,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:22,672][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:23,318][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:23,963][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:24,286][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:24,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:25,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:26,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:26,423][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:26,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:27,422][__main__][INFO] - Iteration 23 took 18s (22.98% Gen, 71.58% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 8m 45s. Estimated total time: 15h 18m 24s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 36s, 500 more iterations: 2h 33m 4s. [2025-11-13 08:13:27,424][__main__][INFO] - Starting iteration 23. [2025-11-13 08:13:27,427][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:27,427][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:31,640][__main__][INFO] - Number of regex retries in iteration 23: 0 [2025-11-13 08:13:31,640][__main__][INFO] - agents played in iteration 23 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:13:32,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:32,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:32,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:32,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:32,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:32,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:32,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:33,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:33,590][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:34,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:35,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:35,524][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:35,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:36,174][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:36,496][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:36,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:37,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:37,469][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:39,089][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:39,732][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:40,055][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:40,378][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:41,024][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:41,347][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:41,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:42,001][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:42,317][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:42,962][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:43,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:44,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:44,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:44,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:44,778][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:45,769][__main__][INFO] - Iteration 24 took 18s (22.97% Gen, 71.62% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 7m 12s. Estimated total time: 15h 17m 9s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 34s, 500 more iterations: 2h 32m 51s. [2025-11-13 08:13:45,772][__main__][INFO] - Starting iteration 24. [2025-11-13 08:13:45,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:45,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:50,039][__main__][INFO] - Number of regex retries in iteration 24: 0 [2025-11-13 08:13:50,039][__main__][INFO] - agents played in iteration 24 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:13:50,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:50,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:50,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:50,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:50,620][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:50,621][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:51,383][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:52,002][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:52,326][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:52,971][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:53,294][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:53,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:54,265][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:54,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:54,909][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:55,556][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:55,878][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:56,203][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:56,526][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:56,849][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:57,172][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:57,494][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:57,817][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:58,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:58,788][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:59,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:59,437][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:00,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:01,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:01,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:02,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:03,191][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:03,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:03,194][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:04,214][__main__][INFO] - Iteration 25 took 18s (23.12% Gen, 71.34% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 11m 42s. Estimated total time: 15h 21m 58s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 43s, 500 more iterations: 2h 33m 39s. [2025-11-13 08:14:04,216][__main__][INFO] - Starting iteration 25. [2025-11-13 08:14:04,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:04,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:08,525][__main__][INFO] - Number of regex retries in iteration 25: 0 [2025-11-13 08:14:08,526][__main__][INFO] - agents played in iteration 25 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:14:09,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:09,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:09,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:09,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:09,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:09,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:09,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:10,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:11,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:11,802][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:12,128][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:12,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:12,772][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:13,096][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:13,423][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:13,742][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:14,064][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:14,711][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:15,034][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:15,358][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:15,681][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:16,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:16,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:16,649][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:17,295][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:17,941][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:18,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:18,586][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:18,910][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:19,232][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:19,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:19,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:20,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:20,946][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:21,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:21,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:21,737][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:22,777][__main__][INFO] - Iteration 26 took 18s (23.20% Gen, 71.19% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 17m 21s. Estimated total time: 15h 27m 55s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 55s, 500 more iterations: 2h 34m 39s. [2025-11-13 08:14:22,779][__main__][INFO] - Starting iteration 26. [2025-11-13 08:14:22,782][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:22,783][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:27,106][__main__][INFO] - Number of regex retries in iteration 26: 0 [2025-11-13 08:14:27,107][__main__][INFO] - agents played in iteration 26 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:14:27,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:27,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:27,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:27,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:27,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:27,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:28,428][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:28,723][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:29,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:29,702][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:30,027][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:30,353][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:31,008][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:31,333][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:31,662][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:31,984][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:32,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:32,963][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:33,613][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:33,939][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:34,261][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:34,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:35,230][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:35,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:35,879][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:36,204][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:36,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:36,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:37,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:38,152][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:38,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:38,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:39,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:40,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:40,279][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:40,281][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:41,272][__main__][INFO] - Iteration 27 took 18s (23.38% Gen, 71.25% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 13m 39s. Estimated total time: 15h 24m 32s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 49s, 500 more iterations: 2h 34m 5s. [2025-11-13 08:14:41,274][__main__][INFO] - Starting iteration 27. [2025-11-13 08:14:41,278][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:41,278][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:45,506][__main__][INFO] - Number of regex retries in iteration 27: 0 [2025-11-13 08:14:45,507][__main__][INFO] - agents played in iteration 27 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:14:45,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:46,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:46,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:46,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:46,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:46,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:46,819][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:48,092][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:48,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:49,390][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:50,362][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:50,689][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:51,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:51,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:51,664][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:51,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:52,309][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:52,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:52,953][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:53,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:53,599][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:53,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:54,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:54,566][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:54,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:55,537][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:55,860][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:56,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:56,828][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:57,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:57,872][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:58,634][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:58,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:58,637][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:59,659][__main__][INFO] - Iteration 28 took 18s (23.01% Gen, 71.43% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 7m 55s. Estimated total time: 15h 19m 6s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 38s, 500 more iterations: 2h 33m 11s. [2025-11-13 08:14:59,661][__main__][INFO] - Starting iteration 28. [2025-11-13 08:14:59,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:59,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:03,887][__main__][INFO] - Number of regex retries in iteration 28: 0 [2025-11-13 08:15:03,887][__main__][INFO] - agents played in iteration 28 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:15:04,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:04,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:04,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:04,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:04,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:04,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:05,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:05,821][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:06,144][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:06,466][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:06,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:07,110][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:07,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:07,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:08,080][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:08,402][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:08,726][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:09,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:09,697][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:10,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:10,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:10,993][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:11,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:11,966][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:12,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:12,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:13,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:13,582][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:13,907][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:14,232][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:14,878][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:15,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:15,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:16,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:17,018][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:17,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:17,022][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:18,045][__main__][INFO] - Iteration 29 took 18s (22.97% Gen, 71.46% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 7m 34s. Estimated total time: 15h 19m 4s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 38s, 500 more iterations: 2h 33m 10s. [2025-11-13 08:15:18,047][__main__][INFO] - Starting iteration 29. [2025-11-13 08:15:18,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:18,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:22,297][__main__][INFO] - Number of regex retries in iteration 29: 0 [2025-11-13 08:15:22,298][__main__][INFO] - agents played in iteration 29 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:15:22,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:22,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:22,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:22,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:22,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:22,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:23,924][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:24,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:24,898][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:25,222][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:25,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:25,869][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:26,516][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:26,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:27,166][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:27,821][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:28,144][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:28,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:29,112][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:30,083][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:30,405][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:31,701][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:32,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:32,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:32,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:33,319][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:33,643][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:33,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:34,708][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:35,473][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:35,474][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:35,476][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:36,480][__main__][INFO] - Iteration 30 took 18s (23.04% Gen, 71.50% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 9m 45s. Estimated total time: 15h 21m 33s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 43s, 500 more iterations: 2h 33m 35s. [2025-11-13 08:15:36,483][__main__][INFO] - Starting iteration 30. [2025-11-13 08:15:36,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:36,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:40,665][__main__][INFO] - Number of regex retries in iteration 30: 0 [2025-11-13 08:15:40,665][__main__][INFO] - agents played in iteration 30 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:15:41,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:41,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:41,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:41,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:41,245][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:41,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:42,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:42,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:42,635][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:43,613][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:43,941][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:44,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:44,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:44,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:45,569][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:45,891][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:46,216][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:46,543][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:47,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:47,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:48,167][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:48,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:48,816][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:49,145][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:49,466][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:49,789][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:50,124][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:50,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:51,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:51,753][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:52,081][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:52,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:53,145][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:53,906][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:53,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:53,909][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:55,841][__main__][INFO] - Iteration 31 took 19s (21.59% Gen, 68.42% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 55m 37s. Estimated total time: 16h 7m 44s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 15s, 500 more iterations: 2h 41m 17s. [2025-11-13 08:15:55,843][__main__][INFO] - Starting iteration 31. [2025-11-13 08:15:55,846][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:15:55,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:00,870][__main__][INFO] - Number of regex retries in iteration 31: 0 [2025-11-13 08:16:00,872][__main__][INFO] - agents played in iteration 31 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:16:01,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:01,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:01,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:01,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:01,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:01,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:02,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:03,502][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:03,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:04,469][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:04,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:05,115][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:05,761][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:06,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:06,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:07,702][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:08,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:08,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:08,684][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:09,007][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:09,331][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:09,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:09,980][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:10,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:10,629][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:11,279][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:11,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:12,263][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:12,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:13,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:14,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:14,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:14,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:15,098][__main__][INFO] - Iteration 32 took 19s (26.10% Gen, 68.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 50m 12s. Estimated total time: 16h 2m 38s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 5s, 500 more iterations: 2h 40m 26s. [2025-11-13 08:16:15,100][__main__][INFO] - Starting iteration 32. [2025-11-13 08:16:15,103][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:15,103][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:19,746][__main__][INFO] - Number of regex retries in iteration 32: 0 [2025-11-13 08:16:19,747][__main__][INFO] - agents played in iteration 32 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:16:20,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:20,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:20,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:20,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:20,360][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:20,360][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:21,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:21,735][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:22,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:22,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:23,349][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:23,673][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:24,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:24,652][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:24,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:25,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:25,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:25,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:26,277][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:26,606][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:27,259][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:27,905][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:28,883][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:29,865][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:30,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:30,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:31,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:32,217][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:32,963][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:32,964][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:32,966][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:33,974][__main__][INFO] - Iteration 33 took 18s (24.61% Gen, 70.04% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 30m 50s. Estimated total time: 15h 43m 35s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 27s, 500 more iterations: 2h 37m 15s. [2025-11-13 08:16:33,976][__main__][INFO] - Starting iteration 33. [2025-11-13 08:16:33,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:33,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:38,621][__main__][INFO] - Number of regex retries in iteration 33: 0 [2025-11-13 08:16:38,622][__main__][INFO] - agents played in iteration 33 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:16:39,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:39,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:39,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:39,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:39,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:39,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:39,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:40,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:40,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:40,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:41,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:41,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:41,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:42,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:42,857][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:43,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:43,505][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:43,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:44,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:44,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:45,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:45,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:45,779][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:46,103][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:46,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:47,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:48,060][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:48,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:48,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:49,683][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:50,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:50,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:51,089][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:51,880][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:51,881][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:51,883][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:52,911][__main__][INFO] - Iteration 34 took 18s (24.52% Gen, 70.04% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 33m 37s. Estimated total time: 15h 46m 41s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 33s, 500 more iterations: 2h 37m 46s. [2025-11-13 08:16:52,914][__main__][INFO] - Starting iteration 34. [2025-11-13 08:16:52,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:52,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:57,661][__main__][INFO] - Number of regex retries in iteration 34: 0 [2025-11-13 08:16:57,661][__main__][INFO] - agents played in iteration 34 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:16:58,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:58,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:58,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:58,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:58,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:58,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:59,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:59,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:59,977][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:00,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:00,624][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:00,950][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:01,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:01,595][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:02,892][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:03,215][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:03,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:03,867][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:04,511][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:05,159][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:05,482][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:05,806][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:06,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:06,456][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:06,779][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:07,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:07,430][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:07,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:08,076][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:08,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:09,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:10,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:10,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:10,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:10,888][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:11,914][__main__][INFO] - Iteration 35 took 18s (24.98% Gen, 69.62% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 36m 30s. Estimated total time: 15h 49m 53s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 39s, 500 more iterations: 2h 38m 18s. [2025-11-13 08:17:11,917][__main__][INFO] - Starting iteration 35. [2025-11-13 08:17:11,920][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:11,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:16,627][__main__][INFO] - Number of regex retries in iteration 35: 0 [2025-11-13 08:17:16,627][__main__][INFO] - agents played in iteration 35 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:17:17,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:17,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:17,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:17,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:17,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:17,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:18,596][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:19,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:19,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:20,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:20,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:21,183][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:21,830][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:22,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:22,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:22,799][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:23,122][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:23,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:24,099][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:24,421][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:25,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:25,715][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:26,365][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:26,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:27,010][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:27,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:28,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:29,060][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:29,810][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:29,811][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:29,813][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:30,795][__main__][INFO] - Iteration 36 took 18s (24.93% Gen, 69.85% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 30m 5s. Estimated total time: 15h 43m 47s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 27s, 500 more iterations: 2h 37m 17s. [2025-11-13 08:17:30,797][__main__][INFO] - Starting iteration 36. [2025-11-13 08:17:30,800][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:30,801][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:35,555][__main__][INFO] - Number of regex retries in iteration 36: 0 [2025-11-13 08:17:35,556][__main__][INFO] - agents played in iteration 36 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:17:36,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:36,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:36,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:36,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:36,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:36,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:36,919][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:37,861][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:38,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:38,829][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:39,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:39,477][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:39,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:40,770][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:41,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:42,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:42,385][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:42,708][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:43,030][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:43,356][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:43,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:44,003][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:44,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:44,649][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:44,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:46,270][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:46,595][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:46,918][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:47,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:48,014][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:48,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:48,793][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:48,795][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:49,821][__main__][INFO] - Iteration 37 took 19s (25.00% Gen, 69.60% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 37m 5s. Estimated total time: 15h 51m 6s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 42s, 500 more iterations: 2h 38m 31s. [2025-11-13 08:17:49,823][__main__][INFO] - Starting iteration 37. [2025-11-13 08:17:49,826][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:49,827][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:54,583][__main__][INFO] - Number of regex retries in iteration 37: 0 [2025-11-13 08:17:54,584][__main__][INFO] - agents played in iteration 37 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:17:55,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:55,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:55,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:55,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:55,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:55,174][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:55,942][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:56,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:57,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:57,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:57,862][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:58,185][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:58,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:59,160][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:59,484][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:00,138][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:00,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:00,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:01,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:01,762][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:02,085][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:02,408][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:02,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:03,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:03,382][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:03,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:04,032][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:04,358][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:04,686][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:05,009][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:05,333][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:05,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:06,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:07,049][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:07,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:07,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:07,793][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:08,830][__main__][INFO] - Iteration 38 took 19s (25.03% Gen, 69.51% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 35m 53s. Estimated total time: 15h 50m 13s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 40s, 500 more iterations: 2h 38m 22s. [2025-11-13 08:18:08,832][__main__][INFO] - Starting iteration 38. [2025-11-13 08:18:08,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:08,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:13,641][__main__][INFO] - Number of regex retries in iteration 38: 0 [2025-11-13 08:18:13,642][__main__][INFO] - agents played in iteration 38 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:18:14,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:14,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:14,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:14,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:14,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:14,214][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:15,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:15,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:15,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:16,899][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:17,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:18,191][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:18,514][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:18,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:19,486][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:20,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:20,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:20,782][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:21,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:21,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:22,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:22,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:23,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:23,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:23,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:24,024][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:24,994][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:25,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:26,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:26,855][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:26,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:26,858][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:27,859][__main__][INFO] - Iteration 39 took 19s (25.26% Gen, 69.47% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 36m 35s. Estimated total time: 15h 51m 14s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 42s, 500 more iterations: 2h 38m 32s. [2025-11-13 08:18:27,862][__main__][INFO] - Starting iteration 39. [2025-11-13 08:18:27,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:27,866][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:32,626][__main__][INFO] - Number of regex retries in iteration 39: 0 [2025-11-13 08:18:32,626][__main__][INFO] - agents played in iteration 39 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:18:33,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:33,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:33,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:33,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:33,208][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:33,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:33,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:34,234][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:34,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:35,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:36,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:36,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:37,147][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:37,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:37,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:38,118][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:38,441][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:38,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:39,735][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:40,058][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:40,383][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:40,712][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:41,032][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:41,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:42,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:42,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:42,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:42,974][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:43,297][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:43,622][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:43,945][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:44,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:45,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:45,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:45,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:45,798][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:46,823][__main__][INFO] - Iteration 40 took 18s (25.11% Gen, 69.48% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 32m 57s. Estimated total time: 15h 47m 55s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 35s, 500 more iterations: 2h 37m 59s. [2025-11-13 08:18:46,825][__main__][INFO] - Starting iteration 40. [2025-11-13 08:18:46,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:46,828][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:51,586][__main__][INFO] - Number of regex retries in iteration 40: 0 [2025-11-13 08:18:51,586][__main__][INFO] - agents played in iteration 40 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:18:52,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:52,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:52,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:52,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:52,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:52,163][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:53,228][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:53,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:53,874][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:54,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:55,167][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:56,138][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:56,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:57,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:57,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:58,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:58,723][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:59,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:59,696][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:00,019][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:00,344][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:00,990][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:01,313][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:01,637][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:02,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:03,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:04,022][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:04,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:04,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:04,802][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:06,754][__main__][INFO] - Iteration 41 took 19s (23.88% Gen, 66.32% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 21m 3s. Estimated total time: 16h 36m 20s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 12s, 500 more iterations: 2h 46m 3s. [2025-11-13 08:19:06,756][__main__][INFO] - Starting iteration 41. [2025-11-13 08:19:06,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:06,760][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:11,973][__main__][INFO] - Number of regex retries in iteration 41: 0 [2025-11-13 08:19:11,973][__main__][INFO] - agents played in iteration 41 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:19:12,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:12,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:12,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:12,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:12,557][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:12,557][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:13,315][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:13,610][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:13,935][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:14,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:14,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:14,904][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:15,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:15,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:16,518][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:16,842][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:17,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:17,486][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:17,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:19,104][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:19,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:20,072][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:20,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:21,368][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:21,698][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:22,022][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:22,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:22,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:23,001][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:23,326][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:23,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:24,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:25,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:25,143][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:25,145][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:26,178][__main__][INFO] - Iteration 42 took 19s (26.84% Gen, 67.83% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 55m 19s. Estimated total time: 16h 10m 57s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 21s, 500 more iterations: 2h 41m 49s. [2025-11-13 08:19:26,180][__main__][INFO] - Starting iteration 42. [2025-11-13 08:19:26,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:26,183][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:31,360][__main__][INFO] - Number of regex retries in iteration 42: 0 [2025-11-13 08:19:31,361][__main__][INFO] - agents played in iteration 42 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:19:31,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:31,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:31,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:31,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:31,976][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:31,976][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:32,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:33,357][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:34,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:34,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:34,650][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:35,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:36,262][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:36,907][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:37,232][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:37,874][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:38,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:38,843][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:39,166][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:39,818][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:40,136][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:40,460][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:40,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:41,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:41,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:42,071][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:43,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:43,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:44,540][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:44,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:44,543][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:45,714][__main__][INFO] - Iteration 43 took 19s (26.51% Gen, 67.49% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 0m 38s. Estimated total time: 16h 16m 35s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 33s, 500 more iterations: 2h 42m 45s. [2025-11-13 08:19:45,716][__main__][INFO] - Starting iteration 43. [2025-11-13 08:19:45,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:45,721][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:50,846][__main__][INFO] - Number of regex retries in iteration 43: 0 [2025-11-13 08:19:50,846][__main__][INFO] - agents played in iteration 43 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:19:51,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:51,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:51,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:51,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:51,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:51,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:52,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:53,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:53,446][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:53,768][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:54,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:55,384][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:55,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:56,350][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:56,996][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:57,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:57,965][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:58,289][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:58,612][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:58,935][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:59,257][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:59,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:00,871][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:01,204][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:01,529][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:01,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:02,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:02,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:03,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:04,156][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:04,158][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:04,159][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:05,182][__main__][INFO] - Iteration 44 took 19s (26.33% Gen, 68.40% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 56m 52s. Estimated total time: 16h 13m 8s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 26s, 500 more iterations: 2h 42m 11s. [2025-11-13 08:20:05,184][__main__][INFO] - Starting iteration 44. [2025-11-13 08:20:05,187][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:05,188][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:10,300][__main__][INFO] - Number of regex retries in iteration 44: 0 [2025-11-13 08:20:10,301][__main__][INFO] - agents played in iteration 44 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:20:10,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:10,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:10,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:10,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:10,869][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:10,869][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:11,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:12,568][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:13,224][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:14,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:15,169][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:15,500][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:15,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:16,149][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:16,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:16,795][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:17,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:17,442][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:17,770][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:18,097][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:18,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:18,751][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:19,079][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:19,405][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:19,728][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:20,053][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:20,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:20,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:21,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:22,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:22,756][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:23,542][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:23,543][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:23,545][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:24,528][__main__][INFO] - Iteration 45 took 19s (26.43% Gen, 68.48% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 50m 29s. Estimated total time: 16h 7m 5s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 14s, 500 more iterations: 2h 41m 10s. [2025-11-13 08:20:24,530][__main__][INFO] - Starting iteration 45. [2025-11-13 08:20:24,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:24,535][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:29,519][__main__][INFO] - Number of regex retries in iteration 45: 0 [2025-11-13 08:20:29,520][__main__][INFO] - agents played in iteration 45 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:20:29,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:30,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:30,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:30,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:30,101][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:30,102][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:31,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:31,453][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:31,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:32,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:32,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:33,068][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:33,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:33,716][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:34,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:35,008][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:35,331][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:35,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:35,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:36,307][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:36,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:36,953][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:37,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:37,600][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:37,922][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:38,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:39,536][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:39,860][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:40,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:41,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:41,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:42,657][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:42,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:42,660][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:43,661][__main__][INFO] - Iteration 46 took 19s (26.06% Gen, 68.69% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 39m 29s. Estimated total time: 15h 56m 24s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 52s, 500 more iterations: 2h 39m 24s. [2025-11-13 08:20:43,663][__main__][INFO] - Starting iteration 46. [2025-11-13 08:20:43,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:43,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:48,696][__main__][INFO] - Number of regex retries in iteration 46: 0 [2025-11-13 08:20:48,697][__main__][INFO] - agents played in iteration 46 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:20:49,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:49,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:49,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:49,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:49,266][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:49,266][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:50,301][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:50,624][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:50,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:51,270][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:51,916][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:52,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:52,883][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:53,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:53,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:54,175][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:54,499][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:54,822][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:55,466][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:55,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:56,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:56,757][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:57,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:57,726][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:58,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:59,022][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:59,344][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:59,989][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:00,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:01,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:01,822][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:01,824][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:01,825][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:02,863][__main__][INFO] - Iteration 47 took 19s (26.21% Gen, 68.38% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 42m 40s. Estimated total time: 15h 59m 54s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 59s, 500 more iterations: 2h 39m 59s. [2025-11-13 08:21:02,865][__main__][INFO] - Starting iteration 47. [2025-11-13 08:21:02,869][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:02,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:07,960][__main__][INFO] - Number of regex retries in iteration 47: 0 [2025-11-13 08:21:07,960][__main__][INFO] - agents played in iteration 47 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:21:08,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:08,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:08,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:08,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:08,528][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:08,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:09,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:09,564][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:09,888][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:10,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:10,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:10,856][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:11,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:11,502][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:11,827][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:12,149][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:12,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:12,794][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:13,118][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:14,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:14,732][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:15,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:15,699][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:16,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:16,345][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:16,667][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:16,989][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:17,315][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:17,637][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:17,962][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:18,284][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:18,610][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:19,258][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:19,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:20,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:21,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:21,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:21,109][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:22,119][__main__][INFO] - Iteration 48 took 19s (26.44% Gen, 68.30% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 45m 0s. Estimated total time: 16h 2m 34s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 5s, 500 more iterations: 2h 40m 25s. [2025-11-13 08:21:22,121][__main__][INFO] - Starting iteration 48. [2025-11-13 08:21:22,125][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:22,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:27,263][__main__][INFO] - Number of regex retries in iteration 48: 0 [2025-11-13 08:21:27,264][__main__][INFO] - agents played in iteration 48 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:21:27,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:27,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:27,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:27,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:27,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:27,863][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:28,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:28,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:29,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:29,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:30,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:30,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:30,843][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:31,165][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:31,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:31,810][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:32,135][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:32,459][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:32,781][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:33,104][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:33,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:33,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:34,082][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:34,405][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:34,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:35,379][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:36,028][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:36,352][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:36,674][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:37,323][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:37,650][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:38,296][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:38,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:38,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:39,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:40,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:40,465][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:40,467][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:41,497][__main__][INFO] - Iteration 49 took 19s (26.52% Gen, 68.15% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 50m 45s. Estimated total time: 16h 8m 38s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 17s, 500 more iterations: 2h 41m 26s. [2025-11-13 08:21:41,499][__main__][INFO] - Starting iteration 49. [2025-11-13 08:21:41,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:41,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:46,688][__main__][INFO] - Number of regex retries in iteration 49: 0 [2025-11-13 08:21:46,688][__main__][INFO] - agents played in iteration 49 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:21:47,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:47,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:47,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:47,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:47,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:47,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:48,012][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:48,634][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:48,954][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:49,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:49,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:50,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:51,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:51,543][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:51,867][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:53,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:53,488][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:53,810][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:54,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:54,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:56,099][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:56,421][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:57,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:58,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:58,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:59,138][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:59,902][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:59,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:59,905][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:00,931][__main__][INFO] - Iteration 50 took 19s (26.69% Gen, 68.03% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 53m 14s. Estimated total time: 16h 11m 27s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 22s, 500 more iterations: 2h 41m 54s. [2025-11-13 08:22:00,933][__main__][INFO] - Starting iteration 50. [2025-11-13 08:22:00,936][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:22:00,936][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:06,054][__main__][INFO] - Number of regex retries in iteration 50: 0 [2025-11-13 08:22:06,054][__main__][INFO] - agents played in iteration 50 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:22:06,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:06,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:06,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:06,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:06,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:06,634][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:07,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:08,023][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:08,346][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:08,670][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:10,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:11,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:11,585][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:12,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:13,205][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:13,528][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:14,181][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:14,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:15,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:15,473][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:15,796][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:16,452][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:17,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:17,420][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:17,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:18,521][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:19,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:19,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:19,291][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:21,285][__main__][INFO] - Iteration 51 took 20s (25.15% Gen, 65.05% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 38m 57s. Estimated total time: 16h 57m 30s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 55s, 500 more iterations: 2h 49m 35s. [2025-11-13 08:22:21,287][__main__][INFO] - Starting iteration 51. [2025-11-13 08:22:21,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:21,291][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:26,802][__main__][INFO] - Number of regex retries in iteration 51: 0 [2025-11-13 08:22:26,802][__main__][INFO] - agents played in iteration 51 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:22:27,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:27,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:27,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:27,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:27,392][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:27,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:28,164][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:28,460][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:28,785][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:29,108][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:29,758][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:30,404][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:31,059][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:31,381][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:31,704][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:32,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:32,683][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:33,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:33,329][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:33,659][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:33,982][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:34,629][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:34,960][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:35,609][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:35,932][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:36,261][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:36,907][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:37,230][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:37,560][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:37,883][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:38,206][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:38,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:39,297][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:40,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:40,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:40,058][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:41,052][__main__][INFO] - Iteration 52 took 19s (27.89% Gen, 67.08% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 9m 13s. Estimated total time: 16h 28m 5s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 56s, 500 more iterations: 2h 44m 40s. [2025-11-13 08:22:41,054][__main__][INFO] - Starting iteration 52. [2025-11-13 08:22:41,057][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:41,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:46,517][__main__][INFO] - Number of regex retries in iteration 52: 0 [2025-11-13 08:22:46,518][__main__][INFO] - agents played in iteration 52 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:22:46,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:47,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:47,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:47,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:47,086][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:47,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:48,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:49,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:49,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:49,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:50,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:51,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:51,359][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:51,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:52,329][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:52,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:52,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:53,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:54,917][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:55,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:55,571][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:56,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:56,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:56,869][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:57,832][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:58,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:58,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:59,711][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:59,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:59,713][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:00,731][__main__][INFO] - Iteration 53 took 19s (27.75% Gen, 67.07% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 4m 31s. Estimated total time: 16h 23m 43s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 47s, 500 more iterations: 2h 43m 57s. [2025-11-13 08:23:00,733][__main__][INFO] - Starting iteration 53. [2025-11-13 08:23:00,735][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:00,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:06,209][__main__][INFO] - Number of regex retries in iteration 53: 0 [2025-11-13 08:23:06,209][__main__][INFO] - agents played in iteration 53 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:23:06,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:06,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:06,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:06,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:06,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:06,777][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:07,626][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:07,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:08,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:08,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:08,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:09,223][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:09,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:09,869][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:10,192][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:10,843][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:11,170][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:12,472][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:12,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:13,123][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:13,450][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:13,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:14,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:14,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:14,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:15,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:15,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:15,726][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:16,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:16,373][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:16,696][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:17,341][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:17,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:18,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:19,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:19,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:19,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:20,546][__main__][INFO] - Iteration 54 took 19s (27.63% Gen, 67.25% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 2s. Estimated total time: 16h 30m 34s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 1s, 500 more iterations: 2h 45m 5s. [2025-11-13 08:23:20,548][__main__][INFO] - Starting iteration 54. [2025-11-13 08:23:20,552][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:20,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:25,979][__main__][INFO] - Number of regex retries in iteration 54: 0 [2025-11-13 08:23:25,980][__main__][INFO] - agents played in iteration 54 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:23:26,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:26,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:26,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:26,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:26,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:26,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:27,911][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:28,236][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:28,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:28,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:29,534][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:29,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:30,508][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:30,833][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:31,157][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:31,480][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:31,804][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:32,452][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:32,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:33,100][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:33,429][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:33,753][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:34,087][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:34,403][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:34,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:35,384][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:35,704][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:36,686][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:37,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:38,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:39,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:39,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:39,155][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:40,194][__main__][INFO] - Iteration 55 took 19s (27.63% Gen, 67.07% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 2m 17s. Estimated total time: 16h 22m 8s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 44s, 500 more iterations: 2h 43m 41s. [2025-11-13 08:23:40,196][__main__][INFO] - Starting iteration 55. [2025-11-13 08:23:40,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:40,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:45,612][__main__][INFO] - Number of regex retries in iteration 55: 0 [2025-11-13 08:23:45,613][__main__][INFO] - agents played in iteration 55 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:23:46,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:46,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:46,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:46,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:46,188][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:46,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:46,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:47,229][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:47,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:48,204][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:48,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:48,852][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:49,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:50,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:50,470][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:51,117][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:51,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:51,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:52,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:52,411][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:52,735][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:53,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:53,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:54,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:54,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:54,681][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:55,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:55,652][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:55,976][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:56,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:56,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:56,950][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:57,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:58,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:58,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:58,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:58,835][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:59,880][__main__][INFO] - Iteration 56 took 19s (27.50% Gen, 67.18% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 3m 54s. Estimated total time: 16h 24m 5s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 48s, 500 more iterations: 2h 44m 0s. [2025-11-13 08:23:59,882][__main__][INFO] - Starting iteration 56. [2025-11-13 08:23:59,886][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:59,886][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:05,296][__main__][INFO] - Number of regex retries in iteration 56: 0 [2025-11-13 08:24:05,296][__main__][INFO] - agents played in iteration 56 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:24:05,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:05,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:05,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:05,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:05,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:05,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:06,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:07,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:07,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:08,576][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:08,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:09,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:09,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:09,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:10,204][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:10,530][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:10,856][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:11,838][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:12,164][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:12,493][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:13,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:13,472][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:14,125][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:14,448][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:14,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:15,425][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:15,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:16,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:16,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:16,727][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:17,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:17,812][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:18,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:18,570][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:18,572][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:19,633][__main__][INFO] - Iteration 57 took 19s (27.40% Gen, 67.22% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 6m 53s. Estimated total time: 16h 27m 24s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 54s, 500 more iterations: 2h 44m 34s. [2025-11-13 08:24:19,635][__main__][INFO] - Starting iteration 57. [2025-11-13 08:24:19,638][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:19,639][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:25,040][__main__][INFO] - Number of regex retries in iteration 57: 0 [2025-11-13 08:24:25,041][__main__][INFO] - agents played in iteration 57 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:24:25,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:25,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:25,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:25,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:25,611][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:25,611][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:26,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:26,986][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:27,310][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:27,633][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:28,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:28,946][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:29,268][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:29,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:29,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:30,249][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:30,576][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:30,902][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:31,232][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:31,560][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:31,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:33,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:33,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:34,177][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:34,505][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:34,828][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:35,154][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:35,478][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:36,127][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:36,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:36,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:37,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:38,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:38,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:38,284][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:39,351][__main__][INFO] - Iteration 58 took 19s (27.40% Gen, 67.18% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 4m 48s. Estimated total time: 16h 25m 39s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 51s, 500 more iterations: 2h 44m 16s. [2025-11-13 08:24:39,353][__main__][INFO] - Starting iteration 58. [2025-11-13 08:24:39,356][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:39,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:44,767][__main__][INFO] - Number of regex retries in iteration 58: 0 [2025-11-13 08:24:44,767][__main__][INFO] - agents played in iteration 58 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:24:45,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:45,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:45,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:45,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:45,351][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:45,352][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:46,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:46,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:47,679][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:48,005][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:48,651][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:48,976][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:49,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:49,626][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:49,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:50,277][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:50,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:50,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:51,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:51,584][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:51,911][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:52,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:53,210][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:54,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:54,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:55,156][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:55,480][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:55,809][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:56,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:56,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:57,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:57,988][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:57,990][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:57,991][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:58,978][__main__][INFO] - Iteration 59 took 19s (27.57% Gen, 67.39% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 59m 57s. Estimated total time: 16h 21m 8s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 42s, 500 more iterations: 2h 43m 31s. [2025-11-13 08:24:58,980][__main__][INFO] - Starting iteration 59. [2025-11-13 08:24:58,984][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:58,985][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:04,431][__main__][INFO] - Number of regex retries in iteration 59: 0 [2025-11-13 08:25:04,432][__main__][INFO] - agents played in iteration 59 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:25:04,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:04,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:04,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:05,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:05,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:05,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:05,767][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:06,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:06,385][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:06,708][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:07,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:07,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:07,677][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:08,969][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:09,294][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:09,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:09,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:10,266][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:10,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:10,913][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:11,236][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:11,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:12,534][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:12,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:13,178][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:13,499][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:13,822][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:14,145][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:14,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:14,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:15,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:15,767][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:16,093][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:16,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:17,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:17,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:17,600][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:18,577][__main__][INFO] - Iteration 60 took 19s (27.80% Gen, 67.21% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 58m 13s. Estimated total time: 16h 19m 43s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 39s, 500 more iterations: 2h 43m 17s. [2025-11-13 08:25:18,579][__main__][INFO] - Starting iteration 60. [2025-11-13 08:25:18,582][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:25:18,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:24,080][__main__][INFO] - Number of regex retries in iteration 60: 0 [2025-11-13 08:25:24,081][__main__][INFO] - agents played in iteration 60 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:25:24,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:24,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:24,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:24,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:24,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:24,664][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:25,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:25,727][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:26,380][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:26,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:27,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:27,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:28,018][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:28,348][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:28,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:28,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:29,319][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:29,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:29,967][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:30,290][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:30,613][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:31,262][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:32,240][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:32,561][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:32,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:34,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:34,504][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:34,831][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:35,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:35,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:35,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:36,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:37,294][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:37,295][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:37,297][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:39,232][__main__][INFO] - Iteration 61 took 20s (26.62% Gen, 64.00% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 50m 42s. Estimated total time: 17h 12m 32s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 25s, 500 more iterations: 2h 52m 5s. [2025-11-13 08:25:39,235][__main__][INFO] - Starting iteration 61. [2025-11-13 08:25:39,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:39,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:45,105][__main__][INFO] - Number of regex retries in iteration 61: 0 [2025-11-13 08:25:45,106][__main__][INFO] - agents played in iteration 61 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:25:45,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:45,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:45,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:45,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:45,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:45,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:46,467][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:46,762][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:47,094][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:47,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:47,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:48,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:48,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:48,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:49,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:49,692][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:50,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:50,339][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:50,663][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:50,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:51,961][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:52,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:52,611][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:52,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:53,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:53,584][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:53,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:54,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:54,554][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:54,878][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:55,202][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:55,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:55,848][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:56,171][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:56,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:56,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:57,585][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:58,322][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:58,324][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:58,326][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:59,419][__main__][INFO] - Iteration 62 took 20s (29.07% Gen, 65.50% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 26m 56s. Estimated total time: 16h 49m 6s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 38s, 500 more iterations: 2h 48m 11s. [2025-11-13 08:25:59,421][__main__][INFO] - Starting iteration 62. [2025-11-13 08:25:59,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:59,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:05,151][__main__][INFO] - Number of regex retries in iteration 62: 0 [2025-11-13 08:26:05,152][__main__][INFO] - agents played in iteration 62 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:26:05,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,734][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:05,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:06,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:07,757][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:08,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:08,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:08,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:09,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:09,708][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:10,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:11,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:11,327][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:11,650][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:11,974][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:12,297][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:12,620][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:12,945][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:13,592][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:14,240][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:14,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:15,210][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:15,534][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:15,858][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:16,182][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:16,504][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:16,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:17,602][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:18,361][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:18,362][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:18,364][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:19,366][__main__][INFO] - Iteration 63 took 19s (28.72% Gen, 66.25% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 14m 36s. Estimated total time: 16h 37m 7s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 14s, 500 more iterations: 2h 46m 11s. [2025-11-13 08:26:19,368][__main__][INFO] - Starting iteration 63. [2025-11-13 08:26:19,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:19,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:25,137][__main__][INFO] - Number of regex retries in iteration 63: 0 [2025-11-13 08:26:25,138][__main__][INFO] - agents played in iteration 63 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:26:25,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:25,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:25,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:25,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:25,710][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:25,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:26,473][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:26,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:27,414][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:27,738][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:29,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:29,355][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:30,003][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:30,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:30,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:30,974][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:31,298][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:31,624][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:31,947][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:32,271][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:32,593][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:32,918][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:33,241][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:34,214][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:34,538][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:34,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:35,184][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:35,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:35,830][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:36,479][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:36,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:37,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:38,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:38,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:38,356][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:39,352][__main__][INFO] - Iteration 64 took 19s (28.86% Gen, 66.15% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 16m 13s. Estimated total time: 16h 39m 3s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 18s, 500 more iterations: 2h 46m 30s. [2025-11-13 08:26:39,355][__main__][INFO] - Starting iteration 64. [2025-11-13 08:26:39,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:39,359][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:45,129][__main__][INFO] - Number of regex retries in iteration 64: 0 [2025-11-13 08:26:45,129][__main__][INFO] - agents played in iteration 64 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:26:45,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:45,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:45,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:45,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:45,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:45,699][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:46,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:46,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:47,065][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:47,387][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:48,048][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:48,371][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:48,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:49,668][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:49,992][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:50,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:50,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:51,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:51,930][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:52,253][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:52,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:53,872][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:54,843][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:56,142][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:56,466][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:56,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:57,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:58,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:58,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:58,331][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:59,342][__main__][INFO] - Iteration 65 took 19s (28.87% Gen, 66.06% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 16m 3s. Estimated total time: 16h 39m 14s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 18s, 500 more iterations: 2h 46m 32s. [2025-11-13 08:26:59,345][__main__][INFO] - Starting iteration 65. [2025-11-13 08:26:59,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:59,348][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:05,110][__main__][INFO] - Number of regex retries in iteration 65: 0 [2025-11-13 08:27:05,111][__main__][INFO] - agents played in iteration 65 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:27:05,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:05,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:05,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:05,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:05,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:05,690][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:07,071][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:07,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:07,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:08,053][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:08,383][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:08,711][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:09,038][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:09,363][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:09,692][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:10,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:11,330][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:11,982][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:12,641][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:12,970][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:13,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:13,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:13,946][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:14,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:14,921][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:15,247][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:16,223][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:16,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:16,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:17,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:18,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:18,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:18,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:19,403][__main__][INFO] - Iteration 66 took 20s (28.73% Gen, 66.17% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 19m 18s. Estimated total time: 16h 42m 49s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 25s, 500 more iterations: 2h 47m 8s. [2025-11-13 08:27:19,406][__main__][INFO] - Starting iteration 66. [2025-11-13 08:27:19,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:19,410][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:25,110][__main__][INFO] - Number of regex retries in iteration 66: 0 [2025-11-13 08:27:25,111][__main__][INFO] - agents played in iteration 66 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:27:25,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:25,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:25,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:25,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:25,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:25,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:26,457][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:26,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:27,077][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:27,403][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:27,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:28,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:28,703][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:29,028][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:29,677][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:30,327][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:30,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:30,975][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:31,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:32,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:32,599][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:32,925][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:33,574][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:34,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:34,549][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:34,872][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:35,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:35,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:35,859][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:36,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:36,516][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:36,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:37,614][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:38,367][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:38,369][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:38,370][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:39,400][__main__][INFO] - Iteration 67 took 19s (28.52% Gen, 66.32% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 15m 45s. Estimated total time: 16h 39m 36s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 19s, 500 more iterations: 2h 46m 36s. [2025-11-13 08:27:39,403][__main__][INFO] - Starting iteration 67. [2025-11-13 08:27:39,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:39,406][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:45,016][__main__][INFO] - Number of regex retries in iteration 67: 0 [2025-11-13 08:27:45,017][__main__][INFO] - agents played in iteration 67 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:27:45,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:45,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:45,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:45,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:45,592][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:45,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:46,332][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:47,601][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:47,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:48,575][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:49,221][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:49,548][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:49,872][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:50,196][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:50,522][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:50,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:51,821][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:52,474][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:53,449][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:53,776][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:54,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:54,752][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:55,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:55,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:55,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:56,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:56,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:57,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:58,229][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:58,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:58,233][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:59,286][__main__][INFO] - Iteration 68 took 19s (28.22% Gen, 66.47% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 9m 52s. Estimated total time: 16h 34m 2s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 8s, 500 more iterations: 2h 45m 40s. [2025-11-13 08:27:59,289][__main__][INFO] - Starting iteration 68. [2025-11-13 08:27:59,292][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:59,293][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:05,126][__main__][INFO] - Number of regex retries in iteration 68: 0 [2025-11-13 08:28:05,127][__main__][INFO] - agents played in iteration 68 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:28:05,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,702][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:05,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:06,737][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:07,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:07,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:08,038][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:08,361][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:08,684][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:09,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:09,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:09,667][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:09,991][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:10,314][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:10,639][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:10,962][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:11,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:11,609][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:11,945][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:12,272][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:12,594][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:12,920][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:13,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:13,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:14,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:14,553][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:15,200][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:16,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:16,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:16,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:17,570][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:18,332][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:18,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:18,336][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:19,347][__main__][INFO] - Iteration 69 took 20s (29.09% Gen, 65.86% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 18m 16s. Estimated total time: 16h 42m 46s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 25s, 500 more iterations: 2h 47m 7s. [2025-11-13 08:28:19,348][__main__][INFO] - Starting iteration 69. [2025-11-13 08:28:19,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:28:19,353][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:25,068][__main__][INFO] - Number of regex retries in iteration 69: 0 [2025-11-13 08:28:25,069][__main__][INFO] - agents played in iteration 69 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:28:25,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:25,620][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:26,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:28,283][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:28,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:28,941][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:29,265][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:29,912][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:30,892][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:31,216][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:31,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:31,871][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:32,193][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:32,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:32,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:33,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:34,149][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:34,797][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:35,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:35,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:36,106][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:36,434][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:36,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:37,503][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:38,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:38,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:38,265][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:39,282][__main__][INFO] - Iteration 70 took 19s (28.68% Gen, 66.21% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 41s. Estimated total time: 16h 36m 31s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 13s, 500 more iterations: 2h 46m 5s. [2025-11-13 08:28:39,284][__main__][INFO] - Starting iteration 70. [2025-11-13 08:28:39,286][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:28:39,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:44,864][__main__][INFO] - Number of regex retries in iteration 70: 0 [2025-11-13 08:28:44,865][__main__][INFO] - agents played in iteration 70 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:28:45,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:45,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:45,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:45,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:45,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:45,436][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:46,494][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:47,153][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:47,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:47,801][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:48,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:48,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:48,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:49,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:49,741][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:50,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:51,366][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:51,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:52,012][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:52,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:52,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:53,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:53,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:53,955][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:54,604][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:55,575][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:55,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:56,225][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:56,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:57,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:58,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:58,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:58,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:00,081][__main__][INFO] - Iteration 71 took 20s (26.83% Gen, 63.58% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 54m 33s. Estimated total time: 17h 19m 44s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 39s, 500 more iterations: 2h 53m 17s. [2025-11-13 08:29:00,083][__main__][INFO] - Starting iteration 71. [2025-11-13 08:29:00,086][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:00,086][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:06,298][__main__][INFO] - Number of regex retries in iteration 71: 0 [2025-11-13 08:29:06,298][__main__][INFO] - agents played in iteration 71 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:29:06,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:06,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:07,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:08,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:08,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:09,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:09,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:10,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:10,510][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:10,833][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:11,806][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:12,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:12,777][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:13,101][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:13,751][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:14,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:14,401][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:14,724][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:15,049][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:15,372][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:15,696][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:16,349][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:16,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:17,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:17,645][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:17,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:18,723][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:19,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:19,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:19,505][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:20,500][__main__][INFO] - Iteration 72 took 20s (30.43% Gen, 64.69% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 35m 14s. Estimated total time: 17h 0m 45s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 1s, 500 more iterations: 2h 50m 7s. [2025-11-13 08:29:20,503][__main__][INFO] - Starting iteration 72. [2025-11-13 08:29:20,506][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:20,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:26,471][__main__][INFO] - Number of regex retries in iteration 72: 0 [2025-11-13 08:29:26,472][__main__][INFO] - agents played in iteration 72 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:29:26,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:27,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:27,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:27,053][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:27,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:27,813][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:28,108][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:29,080][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:29,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:29,728][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:30,377][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:31,024][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:31,676][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:32,971][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:33,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:34,919][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:35,243][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:36,220][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:36,869][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:37,196][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:38,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:38,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:39,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:39,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:39,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:40,761][__main__][INFO] - Iteration 73 took 20s (29.45% Gen, 65.50% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 26m 55s. Estimated total time: 16h 52m 47s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 45s, 500 more iterations: 2h 48m 47s. [2025-11-13 08:29:40,763][__main__][INFO] - Starting iteration 73. [2025-11-13 08:29:40,766][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:40,767][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:46,761][__main__][INFO] - Number of regex retries in iteration 73: 0 [2025-11-13 08:29:46,762][__main__][INFO] - agents played in iteration 73 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:29:47,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:47,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:47,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:47,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:47,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:47,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:48,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:48,722][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:49,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:49,369][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:50,344][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:50,668][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:50,991][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:51,315][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:52,610][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:52,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:53,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:53,905][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:54,229][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:54,875][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:55,198][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:56,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:56,490][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:56,815][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:57,140][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:57,463][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:57,786][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:58,110][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:58,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:59,200][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:59,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:59,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:59,974][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:01,024][__main__][INFO] - Iteration 74 took 20s (29.59% Gen, 65.22% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 26m 44s. Estimated total time: 16h 52m 56s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 45s, 500 more iterations: 2h 48m 49s. [2025-11-13 08:30:01,026][__main__][INFO] - Starting iteration 74. [2025-11-13 08:30:01,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:01,030][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:07,007][__main__][INFO] - Number of regex retries in iteration 74: 0 [2025-11-13 08:30:07,008][__main__][INFO] - agents played in iteration 74 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:30:07,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:07,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:07,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:07,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:07,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:07,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:08,655][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:08,983][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:09,630][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:09,954][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:10,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:10,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:10,932][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:11,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:11,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:11,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:12,243][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:12,567][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:12,893][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:13,541][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:13,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:14,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:14,841][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:15,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:15,487][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:16,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:17,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:17,432][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:17,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:18,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:18,404][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:18,728][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:19,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:20,250][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:20,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:20,253][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:21,264][__main__][INFO] - Iteration 75 took 20s (29.54% Gen, 65.45% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 25m 14s. Estimated total time: 16h 51m 47s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 43s, 500 more iterations: 2h 48m 37s. [2025-11-13 08:30:21,267][__main__][INFO] - Starting iteration 75. [2025-11-13 08:30:21,270][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:21,270][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:27,337][__main__][INFO] - Number of regex retries in iteration 75: 0 [2025-11-13 08:30:27,337][__main__][INFO] - agents played in iteration 75 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:30:27,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:27,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:29,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:29,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:29,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:30,256][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:30,580][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:31,230][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:31,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:32,200][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:32,524][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:32,848][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:33,172][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:33,498][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:33,820][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:34,144][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:34,466][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:36,093][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:37,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:37,709][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:38,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:38,357][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:38,688][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:39,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:39,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:40,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:40,538][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:40,539][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:41,541][__main__][INFO] - Iteration 76 took 20s (29.92% Gen, 65.12% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 26m 45s. Estimated total time: 16h 53m 38s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 47s, 500 more iterations: 2h 48m 56s. [2025-11-13 08:30:41,544][__main__][INFO] - Starting iteration 76. [2025-11-13 08:30:41,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:41,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:47,536][__main__][INFO] - Number of regex retries in iteration 76: 0 [2025-11-13 08:30:47,537][__main__][INFO] - agents played in iteration 76 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:30:47,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:48,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:48,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:48,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:48,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:48,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:48,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:49,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:49,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:49,807][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:50,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:50,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:50,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:51,112][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:51,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:52,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:52,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:53,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:53,727][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:54,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:54,706][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:55,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:55,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:56,011][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:56,334][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:56,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:56,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:57,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:57,961][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:58,286][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:58,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:59,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:00,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:00,762][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:00,764][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:00,766][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:01,781][__main__][INFO] - Iteration 77 took 20s (29.60% Gen, 65.38% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 24m 32s. Estimated total time: 16h 51m 45s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 43s, 500 more iterations: 2h 48m 37s. [2025-11-13 08:31:01,783][__main__][INFO] - Starting iteration 77. [2025-11-13 08:31:01,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:01,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:07,737][__main__][INFO] - Number of regex retries in iteration 77: 0 [2025-11-13 08:31:07,737][__main__][INFO] - agents played in iteration 77 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:31:08,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:08,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:08,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:08,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:08,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:08,302][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:09,083][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:09,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:09,705][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:10,028][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:10,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:10,688][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:11,012][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:11,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:11,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:11,985][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:13,294][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:13,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:13,948][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:14,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:14,594][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:14,918][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:15,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:15,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:16,546][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:17,198][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:17,524][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:18,507][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:18,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:19,161][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:19,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:20,248][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:21,020][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:21,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:21,024][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:22,038][__main__][INFO] - Iteration 78 took 20s (29.38% Gen, 65.60% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 25m 2s. Estimated total time: 16h 52m 35s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 45s, 500 more iterations: 2h 48m 45s. [2025-11-13 08:31:22,040][__main__][INFO] - Starting iteration 78. [2025-11-13 08:31:22,043][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:22,043][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:28,001][__main__][INFO] - Number of regex retries in iteration 78: 0 [2025-11-13 08:31:28,002][__main__][INFO] - agents played in iteration 78 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:31:28,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:28,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:28,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:28,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:28,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:28,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:29,633][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:30,938][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:32,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:32,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:33,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:33,538][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:33,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:34,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:34,516][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:34,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:35,492][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:35,816][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:36,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:36,467][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:36,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:37,114][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:37,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:38,095][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:38,412][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:38,735][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:39,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:39,391][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:39,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:40,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:41,241][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:41,243][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:41,244][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:42,274][__main__][INFO] - Iteration 79 took 20s (29.45% Gen, 65.45% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 23m 42s. Estimated total time: 16h 51m 36s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 43s, 500 more iterations: 2h 48m 36s. [2025-11-13 08:31:42,276][__main__][INFO] - Starting iteration 79. [2025-11-13 08:31:42,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:42,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:48,202][__main__][INFO] - Number of regex retries in iteration 79: 0 [2025-11-13 08:31:48,202][__main__][INFO] - agents played in iteration 79 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:31:48,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:48,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:48,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:48,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:48,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:48,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:49,828][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:50,477][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:50,803][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:51,126][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:51,450][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:51,773][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:52,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:52,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:52,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:53,068][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:53,392][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:54,039][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:54,362][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:54,686][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:55,009][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:55,332][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:55,989][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:56,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:56,641][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:57,616][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:57,936][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:58,588][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:58,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:59,238][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:59,562][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:59,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:00,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:01,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:01,418][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:01,420][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:02,605][__main__][INFO] - Iteration 80 took 20s (29.14% Gen, 65.02% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 28m 8s. Estimated total time: 16h 56m 22s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 52s, 500 more iterations: 2h 49m 23s. [2025-11-13 08:32:02,607][__main__][INFO] - Starting iteration 80. [2025-11-13 08:32:02,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:32:02,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:08,516][__main__][INFO] - Number of regex retries in iteration 80: 0 [2025-11-13 08:32:08,517][__main__][INFO] - agents played in iteration 80 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:32:08,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:09,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:09,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:09,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:09,088][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:09,089][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:10,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:11,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:12,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:12,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:14,008][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:14,333][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:14,656][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:14,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:15,303][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:15,626][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:15,949][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:16,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:16,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:17,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:17,890][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:18,859][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:19,505][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:19,828][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:20,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:20,906][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:21,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:21,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:21,657][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:23,618][__main__][INFO] - Iteration 81 took 21s (28.11% Gen, 62.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 17h 1m 53s. Estimated total time: 17h 30m 28s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 0s, 500 more iterations: 2h 55m 4s. [2025-11-13 08:32:23,621][__main__][INFO] - Starting iteration 81. [2025-11-13 08:32:23,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:23,624][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:29,879][__main__][INFO] - Number of regex retries in iteration 81: 0 [2025-11-13 08:32:29,880][__main__][INFO] - agents played in iteration 81 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:32:30,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:30,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:30,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:30,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:30,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:30,458][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:31,218][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:31,841][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:32,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:32,493][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:32,816][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:33,140][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:33,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:34,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:35,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:35,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:36,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:36,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:36,715][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:37,368][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:37,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:38,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:38,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:39,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:39,647][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:39,977][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:40,302][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:40,627][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:40,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:41,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:42,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:43,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:43,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:43,115][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:44,109][__main__][INFO] - Iteration 82 took 20s (30.53% Gen, 64.61% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 35m 25s. Estimated total time: 17h 4m 20s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 8s, 500 more iterations: 2h 50m 43s. [2025-11-13 08:32:44,112][__main__][INFO] - Starting iteration 82. [2025-11-13 08:32:44,114][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:44,115][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:50,314][__main__][INFO] - Number of regex retries in iteration 82: 0 [2025-11-13 08:32:50,315][__main__][INFO] - agents played in iteration 82 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:32:50,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:50,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:50,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:50,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:50,872][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:50,872][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:51,614][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:51,908][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:52,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:52,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:53,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:54,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:54,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:55,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:55,475][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:56,121][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:56,446][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:56,772][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:57,095][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:57,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:58,066][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:58,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:59,038][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:59,361][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:59,688][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:00,657][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:00,982][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:01,629][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:01,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:02,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:03,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:03,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:03,434][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:04,473][__main__][INFO] - Iteration 83 took 20s (30.46% Gen, 64.43% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 28m 43s. Estimated total time: 16h 57m 59s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 55s, 500 more iterations: 2h 49m 39s. [2025-11-13 08:33:04,475][__main__][INFO] - Starting iteration 83. [2025-11-13 08:33:04,478][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:04,479][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:10,805][__main__][INFO] - Number of regex retries in iteration 83: 0 [2025-11-13 08:33:10,806][__main__][INFO] - agents played in iteration 83 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:33:11,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:11,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:11,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:11,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:11,375][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:11,375][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:12,424][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:12,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:13,401][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:14,050][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:14,376][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:14,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:15,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:15,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:15,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:16,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:16,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:17,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:17,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:18,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:18,940][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:19,269][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:19,602][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:19,926][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:20,251][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:20,905][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:21,879][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:22,203][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:22,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:23,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:24,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:24,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:24,016][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:24,992][__main__][INFO] - Iteration 84 took 20s (30.84% Gen, 64.39% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 36m 7s. Estimated total time: 17h 5m 43s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 11s, 500 more iterations: 2h 50m 57s. [2025-11-13 08:33:24,994][__main__][INFO] - Starting iteration 84. [2025-11-13 08:33:24,997][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:24,998][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:31,200][__main__][INFO] - Number of regex retries in iteration 84: 0 [2025-11-13 08:33:31,200][__main__][INFO] - agents played in iteration 84 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:33:31,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:31,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:31,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:31,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:31,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:31,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:32,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:32,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:33,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:33,492][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:33,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:34,149][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:34,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:35,128][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:35,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:35,784][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:36,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:37,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:37,734][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:38,058][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:38,382][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:38,704][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:39,353][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:40,005][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:40,332][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:40,980][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:41,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:42,274][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:42,601][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:42,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:43,665][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:44,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:44,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:44,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:45,526][__main__][INFO] - Iteration 85 took 20s (30.21% Gen, 64.41% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 36m 32s. Estimated total time: 17h 6m 29s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 12s, 500 more iterations: 2h 51m 4s. [2025-11-13 08:33:45,528][__main__][INFO] - Starting iteration 85. [2025-11-13 08:33:45,531][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:45,531][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:51,780][__main__][INFO] - Number of regex retries in iteration 85: 0 [2025-11-13 08:33:51,781][__main__][INFO] - agents played in iteration 85 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:33:52,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:52,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:52,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:52,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:52,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:52,373][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:53,127][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:54,068][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:54,394][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:54,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:55,043][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:55,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:56,992][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:58,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:58,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:58,939][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:59,263][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:59,588][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:00,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:00,560][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:00,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:01,857][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:02,181][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:02,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:02,829][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:03,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:03,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:04,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:04,990][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:04,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:04,993][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:05,984][__main__][INFO] - Iteration 86 took 20s (30.55% Gen, 64.60% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 32m 23s. Estimated total time: 17h 2m 40s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 5s, 500 more iterations: 2h 50m 26s. [2025-11-13 08:34:05,986][__main__][INFO] - Starting iteration 86. [2025-11-13 08:34:05,989][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:05,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:12,224][__main__][INFO] - Number of regex retries in iteration 86: 0 [2025-11-13 08:34:12,225][__main__][INFO] - agents played in iteration 86 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:34:12,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:12,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:12,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:12,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:12,811][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:12,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:14,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:15,183][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:15,512][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:16,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:16,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:16,811][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:17,137][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:17,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:18,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:18,761][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:19,417][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:19,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:20,713][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:21,361][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:21,685][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:22,012][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:22,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:22,987][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:23,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:23,638][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:23,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:24,733][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:25,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:25,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:25,493][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:26,509][__main__][INFO] - Iteration 87 took 20s (30.38% Gen, 64.66% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 35m 26s. Estimated total time: 17h 6m 4s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 12s, 500 more iterations: 2h 51m 0s. [2025-11-13 08:34:26,511][__main__][INFO] - Starting iteration 87. [2025-11-13 08:34:26,514][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:26,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:32,806][__main__][INFO] - Number of regex retries in iteration 87: 0 [2025-11-13 08:34:32,806][__main__][INFO] - agents played in iteration 87 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:34:33,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:33,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:33,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:33,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:33,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:33,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:34,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:34,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:34,741][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:35,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:35,726][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:36,053][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:36,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:37,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:37,355][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:37,679][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:38,013][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:38,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:38,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:38,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:39,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:40,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:40,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:41,269][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:41,923][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:42,247][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:42,572][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:43,548][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:43,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:44,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:45,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:46,053][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:46,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:46,056][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:47,085][__main__][INFO] - Iteration 88 took 20s (30.58% Gen, 64.41% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 37m 36s. Estimated total time: 17h 8m 34s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 17s, 500 more iterations: 2h 51m 25s. [2025-11-13 08:34:47,087][__main__][INFO] - Starting iteration 88. [2025-11-13 08:34:47,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:47,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:53,422][__main__][INFO] - Number of regex retries in iteration 88: 0 [2025-11-13 08:34:53,422][__main__][INFO] - agents played in iteration 88 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:34:53,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:53,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:53,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:53,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:53,996][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:53,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:54,753][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:55,048][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:56,993][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:57,317][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:57,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:57,966][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:58,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:58,615][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:58,938][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:00,559][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:00,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:01,208][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:02,512][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:02,837][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:03,488][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:03,812][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:04,136][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:04,460][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:04,785][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:05,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:05,881][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:06,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:06,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:06,674][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:07,691][__main__][INFO] - Iteration 89 took 20s (30.73% Gen, 64.33% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 38m 45s. Estimated total time: 17h 10m 3s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 20s, 500 more iterations: 2h 51m 40s. [2025-11-13 08:35:07,693][__main__][INFO] - Starting iteration 89. [2025-11-13 08:35:07,696][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:35:07,697][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:14,004][__main__][INFO] - Number of regex retries in iteration 89: 0 [2025-11-13 08:35:14,005][__main__][INFO] - agents played in iteration 89 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:35:14,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:14,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:14,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:14,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:14,569][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:14,569][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:15,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:15,607][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:15,932][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:16,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:16,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:16,902][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:17,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:17,873][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:18,196][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:18,528][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:18,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:19,169][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:20,142][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:20,466][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:20,789][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:21,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:21,762][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:22,086][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:22,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:22,733][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:23,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:24,030][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:24,354][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:24,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:25,649][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:26,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:27,178][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:27,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:27,181][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:28,173][__main__][INFO] - Iteration 90 took 20s (30.80% Gen, 64.35% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 32m 13s. Estimated total time: 17h 3m 52s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 7s, 500 more iterations: 2h 50m 38s. [2025-11-13 08:35:28,175][__main__][INFO] - Starting iteration 90. [2025-11-13 08:35:28,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:35:28,179][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:34,379][__main__][INFO] - Number of regex retries in iteration 90: 0 [2025-11-13 08:35:34,380][__main__][INFO] - agents played in iteration 90 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:35:34,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:34,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:34,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:34,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:34,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:34,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:35,731][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:36,350][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:36,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:37,653][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:38,303][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:38,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:38,950][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:39,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:39,599][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:39,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:40,249][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:40,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:40,896][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:41,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:41,544][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:41,868][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:42,191][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:42,515][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:42,840][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:43,163][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:43,489][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:43,813][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:44,136][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:44,785][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:45,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:46,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:46,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:47,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:47,582][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:47,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:49,566][__main__][INFO] - Iteration 91 took 21s (28.99% Gen, 61.73% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 17m 24s. Estimated total time: 17h 49m 25s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 38s, 500 more iterations: 2h 58m 14s. [2025-11-13 08:35:49,568][__main__][INFO] - Starting iteration 91. [2025-11-13 08:35:49,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:49,572][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:56,321][__main__][INFO] - Number of regex retries in iteration 91: 0 [2025-11-13 08:35:56,321][__main__][INFO] - agents played in iteration 91 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:35:56,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:56,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:56,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:56,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:56,904][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:56,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:57,679][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:58,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:58,625][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:58,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:59,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:59,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:59,924][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:00,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:00,575][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:00,901][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:01,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:01,555][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:02,203][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:02,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:02,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:03,184][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:03,507][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:03,834][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:04,162][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:04,488][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:04,811][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:05,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:05,459][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:05,782][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:06,107][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:06,434][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:06,759][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:07,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:07,733][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:08,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:08,832][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:09,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:09,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:09,585][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:10,592][__main__][INFO] - Iteration 92 took 21s (32.11% Gen, 63.10% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 58m 43s. Estimated total time: 17h 31m 4s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 2s, 500 more iterations: 2h 55m 10s. [2025-11-13 08:36:10,594][__main__][INFO] - Starting iteration 92. [2025-11-13 08:36:10,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:10,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:17,156][__main__][INFO] - Number of regex retries in iteration 92: 0 [2025-11-13 08:36:17,157][__main__][INFO] - agents played in iteration 92 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:36:17,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:17,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:17,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:17,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:17,738][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:17,738][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:18,508][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:19,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:19,779][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:20,104][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:20,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:21,075][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:21,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:22,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:22,371][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:23,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:23,351][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:23,668][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:23,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:24,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:24,643][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:24,967][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:25,293][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:25,617][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:25,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:26,266][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:26,592][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:27,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:27,889][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:28,213][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:28,537][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:28,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:29,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:30,429][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:30,430][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:30,432][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:31,517][__main__][INFO] - Iteration 93 took 20s (31.35% Gen, 63.46% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 53m 18s. Estimated total time: 17h 26m 0s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 52s, 500 more iterations: 2h 54m 20s. [2025-11-13 08:36:31,519][__main__][INFO] - Starting iteration 93. [2025-11-13 08:36:31,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:31,523][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:38,058][__main__][INFO] - Number of regex retries in iteration 93: 0 [2025-11-13 08:36:38,059][__main__][INFO] - agents played in iteration 93 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:36:38,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:38,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:38,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:38,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:38,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:38,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:39,374][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:39,669][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:39,995][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:40,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:40,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:40,966][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:41,937][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:42,585][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:42,911][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:43,235][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:43,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:44,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:44,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:45,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:46,167][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:46,491][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:47,465][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:47,791][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:48,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:48,438][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:48,763][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:49,088][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:49,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:49,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:50,498][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:51,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:51,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:51,294][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:52,322][__main__][INFO] - Iteration 94 took 20s (31.42% Gen, 63.63% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 46m 56s. Estimated total time: 17h 20m 0s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 40s, 500 more iterations: 2h 53m 20s. [2025-11-13 08:36:52,324][__main__][INFO] - Starting iteration 94. [2025-11-13 08:36:52,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:52,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:58,903][__main__][INFO] - Number of regex retries in iteration 94: 0 [2025-11-13 08:36:58,904][__main__][INFO] - agents played in iteration 94 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:36:59,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:59,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:59,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:59,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:59,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:59,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:00,524][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:01,175][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:01,497][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:02,146][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:02,471][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:02,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:03,444][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:03,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:04,091][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:04,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:05,062][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:06,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:06,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:06,687][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:07,011][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:07,983][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:08,307][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:08,630][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:10,257][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:10,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:11,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:12,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:12,143][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:12,145][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:13,197][__main__][INFO] - Iteration 95 took 20s (31.50% Gen, 63.45% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 50m 4s. Estimated total time: 17h 23m 29s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 46s, 500 more iterations: 2h 53m 54s. [2025-11-13 08:37:13,199][__main__][INFO] - Starting iteration 95. [2025-11-13 08:37:13,202][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:13,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:19,726][__main__][INFO] - Number of regex retries in iteration 95: 0 [2025-11-13 08:37:19,727][__main__][INFO] - agents played in iteration 95 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:37:20,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:20,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:20,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:20,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:20,303][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:20,304][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:21,385][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:21,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:22,687][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:23,012][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:23,337][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:23,985][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:24,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:24,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:24,960][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:25,284][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:25,608][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:25,945][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:26,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:26,601][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:26,925][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:27,255][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:29,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:29,524][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:30,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:31,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:31,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:32,275][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:33,033][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:33,035][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:33,037][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:34,057][__main__][INFO] - Iteration 96 took 20s (31.28% Gen, 63.82% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 49m 1s. Estimated total time: 17h 22m 46s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 45s, 500 more iterations: 2h 53m 47s. [2025-11-13 08:37:34,058][__main__][INFO] - Starting iteration 96. [2025-11-13 08:37:34,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:34,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:40,674][__main__][INFO] - Number of regex retries in iteration 96: 0 [2025-11-13 08:37:40,675][__main__][INFO] - agents played in iteration 96 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:37:41,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:41,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:41,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:41,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:41,263][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:41,264][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:42,001][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:42,296][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:42,624][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:42,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:43,595][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:43,920][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:44,244][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:44,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:44,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:45,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:45,546][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:45,870][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:46,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:46,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:46,847][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:47,173][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:48,145][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:48,472][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:48,794][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:49,118][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:49,771][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:50,420][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:50,744][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:51,394][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:51,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:52,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:52,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:53,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:53,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:53,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:53,900][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:55,065][__main__][INFO] - Iteration 97 took 21s (31.48% Gen, 62.96% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 56m 8s. Estimated total time: 17h 30m 14s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 0s, 500 more iterations: 2h 55m 2s. [2025-11-13 08:37:55,067][__main__][INFO] - Starting iteration 97. [2025-11-13 08:37:55,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:55,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:01,539][__main__][INFO] - Number of regex retries in iteration 97: 0 [2025-11-13 08:38:01,539][__main__][INFO] - agents played in iteration 97 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:38:02,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:02,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:02,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:02,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:02,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:02,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:02,924][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:03,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:03,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:03,867][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:04,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:04,840][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:05,165][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:05,489][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:05,812][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:06,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:06,460][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:06,785][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:07,760][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:08,408][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:09,064][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:10,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:11,009][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:11,661][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:11,985][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:12,962][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:13,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:14,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:14,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:14,780][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:14,781][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:15,805][__main__][INFO] - Iteration 98 took 20s (31.19% Gen, 63.86% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 42m 20s. Estimated total time: 17h 16m 47s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 33s, 500 more iterations: 2h 52m 47s. [2025-11-13 08:38:15,807][__main__][INFO] - Starting iteration 98. [2025-11-13 08:38:15,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:15,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:22,177][__main__][INFO] - Number of regex retries in iteration 98: 0 [2025-11-13 08:38:22,178][__main__][INFO] - agents played in iteration 98 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:38:22,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:22,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:22,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:22,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:22,742][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:22,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:23,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:23,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:24,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:24,428][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:24,757][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:25,404][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:25,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:26,058][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:26,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:27,028][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:28,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:28,656][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:29,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:30,615][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:30,939][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:31,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:32,239][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:32,563][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:32,886][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:33,211][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:33,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:33,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:34,618][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:35,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:35,369][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:35,371][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:36,381][__main__][INFO] - Iteration 99 took 20s (30.95% Gen, 64.13% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 33m 47s. Estimated total time: 17h 8m 34s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 17s, 500 more iterations: 2h 51m 25s. [2025-11-13 08:38:36,383][__main__][INFO] - Starting iteration 99. [2025-11-13 08:38:36,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:36,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:42,873][__main__][INFO] - Number of regex retries in iteration 99: 0 [2025-11-13 08:38:42,874][__main__][INFO] - agents played in iteration 99 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:38:43,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:43,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:43,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:43,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:43,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:43,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:44,207][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:44,501][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:44,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:45,150][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:45,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:45,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:46,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:46,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:46,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:47,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:47,416][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:48,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:48,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:49,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:49,368][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:49,691][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:50,017][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:50,343][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:50,669][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:50,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:51,973][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:52,296][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:52,620][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:53,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:53,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:53,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:54,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:55,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:56,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:56,121][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:56,123][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:57,131][__main__][INFO] - Iteration 100 took 20s (31.27% Gen, 63.87% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 42m 10s. Estimated total time: 17h 17m 18s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 34s, 500 more iterations: 2h 52m 53s. [2025-11-13 08:38:57,134][__main__][INFO] - Starting iteration 100. [2025-11-13 08:38:57,137][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:57,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:03,654][__main__][INFO] - Number of regex retries in iteration 100: 0 [2025-11-13 08:39:03,654][__main__][INFO] - agents played in iteration 100 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:39:04,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:04,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:04,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:04,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:04,236][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:04,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:05,297][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:05,619][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:06,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:06,593][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:06,916][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:07,564][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:08,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:08,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:09,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:09,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:09,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:10,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:10,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:11,441][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:12,413][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:12,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:13,060][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:13,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:13,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:14,034][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:14,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:15,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:15,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:16,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:16,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:16,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:16,861][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:18,804][__main__][INFO] - Iteration 101 took 21s (30.07% Gen, 60.95% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 27m 53s. Estimated total time: 18h 3m 23s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 6s, 500 more iterations: 3h 0m 33s. [2025-11-13 08:39:18,806][__main__][INFO] - Starting iteration 101. [2025-11-13 08:39:18,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:18,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:25,808][__main__][INFO] - Number of regex retries in iteration 101: 0 [2025-11-13 08:39:25,809][__main__][INFO] - agents played in iteration 101 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:39:26,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:26,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:26,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:26,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:26,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:26,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:27,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:28,085][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:28,411][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:28,737][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:29,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:29,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:30,363][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:30,687][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:31,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:31,663][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:32,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:33,303][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:33,626][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:33,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:34,273][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:34,597][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:34,923][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:35,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:35,571][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:36,231][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:36,880][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:37,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:38,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:39,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:39,075][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:39,076][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:40,090][__main__][INFO] - Iteration 102 took 21s (32.88% Gen, 62.34% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 8m 13s. Estimated total time: 17h 44m 4s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 28s, 500 more iterations: 2h 57m 20s. [2025-11-13 08:39:40,092][__main__][INFO] - Starting iteration 102. [2025-11-13 08:39:40,095][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:40,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:46,895][__main__][INFO] - Number of regex retries in iteration 102: 0 [2025-11-13 08:39:46,895][__main__][INFO] - agents played in iteration 102 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:39:47,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:47,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:47,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:47,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:47,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:47,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:48,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:48,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:50,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:50,819][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:51,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:51,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:52,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:52,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:53,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:53,414][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:53,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:54,061][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:54,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:54,711][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:55,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:55,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:56,334][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:56,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:56,984][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:57,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:57,955][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:58,278][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:58,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:59,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:00,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:00,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:00,136][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:01,189][__main__][INFO] - Iteration 103 took 21s (32.23% Gen, 62.77% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 58m 31s. Estimated total time: 17h 34m 43s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 9s, 500 more iterations: 2h 55m 47s. [2025-11-13 08:40:01,191][__main__][INFO] - Starting iteration 103. [2025-11-13 08:40:01,194][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:01,195][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:08,027][__main__][INFO] - Number of regex retries in iteration 103: 0 [2025-11-13 08:40:08,027][__main__][INFO] - agents played in iteration 103 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:40:08,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:08,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:08,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:08,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:08,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:08,624][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:09,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:09,999][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:10,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:10,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:10,973][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:11,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:11,945][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:12,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:12,593][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:12,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:13,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:13,565][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:13,888][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:14,535][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:15,505][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:15,829][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:16,152][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:16,481][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:16,802][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:17,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:17,784][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:18,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:18,425][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:18,748][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:19,396][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:19,719][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:20,531][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:21,302][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:21,304][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:21,306][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:22,324][__main__][INFO] - Iteration 104 took 21s (32.34% Gen, 62.84% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 59m 58s. Estimated total time: 17h 36m 32s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 13s, 500 more iterations: 2h 56m 5s. [2025-11-13 08:40:22,326][__main__][INFO] - Starting iteration 104. [2025-11-13 08:40:22,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:22,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:29,165][__main__][INFO] - Number of regex retries in iteration 104: 0 [2025-11-13 08:40:29,166][__main__][INFO] - agents played in iteration 104 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:40:29,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:29,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:29,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:29,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:29,740][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:29,741][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:30,823][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:31,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:31,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:32,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:33,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:33,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:33,743][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:34,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:34,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:34,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:35,037][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:35,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:35,686][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:36,014][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:36,335][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:37,631][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:37,955][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:38,607][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:38,926][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:39,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:39,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:40,222][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:40,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:40,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:41,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:42,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:42,380][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:42,382][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:43,372][__main__][INFO] - Iteration 105 took 21s (32.49% Gen, 62.80% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 55m 18s. Estimated total time: 17h 32m 12s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 4s, 500 more iterations: 2h 55m 22s. [2025-11-13 08:40:43,374][__main__][INFO] - Starting iteration 105. [2025-11-13 08:40:43,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:43,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:50,162][__main__][INFO] - Number of regex retries in iteration 105: 0 [2025-11-13 08:40:50,163][__main__][INFO] - agents played in iteration 105 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:40:50,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:50,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:50,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:50,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:50,740][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:50,740][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:51,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:51,794][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:52,120][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:52,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:52,767][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:53,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:53,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:54,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:54,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:54,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:55,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:55,687][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:56,009][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:56,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:56,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:57,304][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:57,628][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:57,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:58,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:58,598][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:59,248][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:59,571][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:00,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:00,874][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:01,200][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:01,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:01,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:02,612][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:03,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:03,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:03,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:04,409][__main__][INFO] - Iteration 106 took 21s (32.26% Gen, 62.72% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 54m 22s. Estimated total time: 17h 31m 38s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 3s, 500 more iterations: 2h 55m 16s. [2025-11-13 08:41:04,412][__main__][INFO] - Starting iteration 106. [2025-11-13 08:41:04,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:04,416][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:11,154][__main__][INFO] - Number of regex retries in iteration 106: 0 [2025-11-13 08:41:11,154][__main__][INFO] - agents played in iteration 106 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:41:11,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:11,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:11,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:11,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:11,739][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:11,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:12,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:12,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:14,084][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:14,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:14,731][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:15,054][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:15,380][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:15,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:16,029][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:16,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:16,678][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:17,326][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:17,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:18,300][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:18,624][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:18,947][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:19,270][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:19,595][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:19,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:20,567][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:20,890][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:21,541][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:21,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:22,192][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:22,515][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:22,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:23,614][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:24,365][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:24,367][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:24,368][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:25,387][__main__][INFO] - Iteration 107 took 20s (32.13% Gen, 63.01% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 51m 2s. Estimated total time: 17h 28m 38s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 57s, 500 more iterations: 2h 54m 46s. [2025-11-13 08:41:25,389][__main__][INFO] - Starting iteration 107. [2025-11-13 08:41:25,393][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:25,393][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:32,229][__main__][INFO] - Number of regex retries in iteration 107: 0 [2025-11-13 08:41:32,230][__main__][INFO] - agents played in iteration 107 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:41:32,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:32,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:32,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:32,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:32,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:32,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:33,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:34,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:34,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:34,855][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:35,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:35,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:36,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:36,818][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:37,142][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:37,467][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:37,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:38,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:38,773][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:39,103][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:39,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:40,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:41,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:41,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:41,717][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:42,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:42,369][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:42,695][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:43,021][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:43,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:44,791][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:45,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:45,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:45,539][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:46,545][__main__][INFO] - Iteration 108 took 21s (32.31% Gen, 62.92% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 59m 42s. Estimated total time: 17h 37m 39s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 15s, 500 more iterations: 2h 56m 16s. [2025-11-13 08:41:46,548][__main__][INFO] - Starting iteration 108. [2025-11-13 08:41:46,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:46,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:53,445][__main__][INFO] - Number of regex retries in iteration 108: 0 [2025-11-13 08:41:53,446][__main__][INFO] - agents played in iteration 108 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:41:53,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:53,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:53,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:54,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:54,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:54,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:55,097][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:55,423][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:55,752][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:56,076][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:56,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:57,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:58,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:58,663][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:58,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:59,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:00,283][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:00,607][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:01,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:01,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:01,910][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:02,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:02,878][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:03,205][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:03,529][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:03,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:04,502][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:05,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:05,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:06,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:06,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:06,656][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:07,636][__main__][INFO] - Iteration 109 took 21s (32.70% Gen, 62.65% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 55m 57s. Estimated total time: 17h 34m 16s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 8s, 500 more iterations: 2h 55m 42s. [2025-11-13 08:42:07,638][__main__][INFO] - Starting iteration 109. [2025-11-13 08:42:07,641][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:42:07,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:14,351][__main__][INFO] - Number of regex retries in iteration 109: 0 [2025-11-13 08:42:14,352][__main__][INFO] - agents played in iteration 109 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:42:14,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:14,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:14,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:14,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:14,933][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:14,933][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:15,986][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:16,311][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:16,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:16,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:17,938][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:18,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:18,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:18,920][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:19,243][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:19,567][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:19,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:20,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:20,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:20,875][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:21,197][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:21,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:21,849][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:22,172][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:22,496][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:22,819][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:23,142][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:23,466][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:24,439][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:24,763][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:25,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:25,412][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:26,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:26,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:27,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:27,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:27,564][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:28,567][__main__][INFO] - Iteration 110 took 20s (32.07% Gen, 63.13% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 42s. Estimated total time: 17h 26m 22s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 52s, 500 more iterations: 2h 54m 23s. [2025-11-13 08:42:28,569][__main__][INFO] - Starting iteration 110. [2025-11-13 08:42:28,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:42:28,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:35,367][__main__][INFO] - Number of regex retries in iteration 110: 0 [2025-11-13 08:42:35,368][__main__][INFO] - agents played in iteration 110 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:42:35,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:35,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:35,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:35,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:35,947][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:35,947][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:36,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:37,975][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:38,298][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:38,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:39,269][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:39,592][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:40,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:40,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:41,216][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:41,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:42,187][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:42,510][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:42,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:43,156][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:43,479][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:44,128][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:44,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:44,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:45,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:45,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:45,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:46,720][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:47,044][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:47,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:48,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:48,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:48,562][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:50,508][__main__][INFO] - Iteration 111 took 21s (30.97% Gen, 60.15% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 37m 44s. Estimated total time: 18h 16m 46s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 33s, 500 more iterations: 3h 2m 47s. [2025-11-13 08:42:50,510][__main__][INFO] - Starting iteration 111. [2025-11-13 08:42:50,513][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:50,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:57,760][__main__][INFO] - Number of regex retries in iteration 111: 0 [2025-11-13 08:42:57,761][__main__][INFO] - agents played in iteration 111 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:42:58,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:58,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:58,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:58,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:58,350][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:58,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:59,104][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:59,399][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:00,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:00,371][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:00,696][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:01,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:02,000][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:02,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:02,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:03,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:03,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:04,276][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:04,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:05,245][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:05,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:05,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:06,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:06,538][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:06,862][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:07,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:07,509][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:07,831][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:08,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:08,802][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:09,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:10,209][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:10,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:10,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:10,972][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:11,953][__main__][INFO] - Iteration 112 took 21s (33.80% Gen, 61.61% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 12m 40s. Estimated total time: 17h 52m 3s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 44s, 500 more iterations: 2h 58m 40s. [2025-11-13 08:43:11,956][__main__][INFO] - Starting iteration 112. [2025-11-13 08:43:11,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:11,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:19,010][__main__][INFO] - Number of regex retries in iteration 112: 0 [2025-11-13 08:43:19,011][__main__][INFO] - agents played in iteration 112 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:43:19,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:19,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:19,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:19,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:19,593][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:19,593][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:20,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:20,981][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:21,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:21,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:22,276][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:22,599][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:22,922][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:23,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:24,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:24,873][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:25,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:25,845][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:26,491][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:27,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:27,462][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:27,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:28,434][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:29,730][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:30,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:30,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:31,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:32,203][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:32,205][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:32,206][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:33,222][__main__][INFO] - Iteration 113 took 21s (33.16% Gen, 62.06% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 3m 27s. Estimated total time: 17h 43m 11s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 26s, 500 more iterations: 2h 57m 11s. [2025-11-13 08:43:33,224][__main__][INFO] - Starting iteration 113. [2025-11-13 08:43:33,228][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:33,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:40,276][__main__][INFO] - Number of regex retries in iteration 113: 0 [2025-11-13 08:43:40,277][__main__][INFO] - agents played in iteration 113 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:43:40,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:40,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:40,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:40,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:40,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:40,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:41,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:41,924][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:42,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:43,222][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:43,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:43,871][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:44,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:45,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:45,819][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:48,092][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:48,745][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:49,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:49,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:50,368][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:50,692][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:51,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:51,662][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:51,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:52,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:53,521][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:53,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:53,524][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:54,512][__main__][INFO] - Iteration 114 took 21s (33.12% Gen, 62.23% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 4m 9s. Estimated total time: 17h 44m 14s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 28s, 500 more iterations: 2h 57m 22s. [2025-11-13 08:43:54,514][__main__][INFO] - Starting iteration 114. [2025-11-13 08:43:54,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:54,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:01,530][__main__][INFO] - Number of regex retries in iteration 114: 0 [2025-11-13 08:44:01,530][__main__][INFO] - agents played in iteration 114 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:44:02,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:02,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:02,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:02,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:02,105][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:02,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:02,943][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:03,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:03,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:04,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:04,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:05,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:05,838][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:06,163][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:06,488][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:06,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:07,134][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:07,458][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:08,430][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:08,754][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:09,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:09,402][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:10,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:11,023][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:11,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:12,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:13,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:14,055][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:14,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:14,803][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:14,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:15,790][__main__][INFO] - Iteration 115 took 21s (32.96% Gen, 62.40% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 3m 15s. Estimated total time: 17h 43m 41s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 27s, 500 more iterations: 2h 57m 16s. [2025-11-13 08:44:15,792][__main__][INFO] - Starting iteration 115. [2025-11-13 08:44:15,795][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:15,796][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:22,895][__main__][INFO] - Number of regex retries in iteration 115: 0 [2025-11-13 08:44:22,896][__main__][INFO] - agents played in iteration 115 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:44:23,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:23,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:23,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:23,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:23,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:23,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:24,258][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:24,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:25,202][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:25,848][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:26,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:26,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:26,818][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:27,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:27,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:27,792][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:28,115][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:28,438][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:29,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:29,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:29,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:30,061][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:30,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:30,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:31,352][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:31,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:32,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:33,621][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:33,945][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:34,271][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:34,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:35,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:36,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:36,100][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:36,101][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:37,089][__main__][INFO] - Iteration 116 took 21s (33.34% Gen, 62.01% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 3m 55s. Estimated total time: 17h 44m 43s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 29s, 500 more iterations: 2h 57m 27s. [2025-11-13 08:44:37,091][__main__][INFO] - Starting iteration 116. [2025-11-13 08:44:37,094][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:37,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:44,151][__main__][INFO] - Number of regex retries in iteration 116: 0 [2025-11-13 08:44:44,151][__main__][INFO] - agents played in iteration 116 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:44:44,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,746][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:44,746][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:45,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:45,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:46,792][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:47,439][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:48,086][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:48,410][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:48,737][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:49,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:49,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:49,708][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:50,038][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:50,362][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:51,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:51,666][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:51,990][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:52,313][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:52,644][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:53,292][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:53,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:54,917][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:55,243][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:55,571][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:55,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:56,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:57,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:57,409][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:57,411][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:58,408][__main__][INFO] - Iteration 117 took 21s (33.11% Gen, 62.21% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 4m 33s. Estimated total time: 17h 45m 42s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 31s, 500 more iterations: 2h 57m 37s. [2025-11-13 08:44:58,410][__main__][INFO] - Starting iteration 117. [2025-11-13 08:44:58,414][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:58,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:05,456][__main__][INFO] - Number of regex retries in iteration 117: 0 [2025-11-13 08:45:05,457][__main__][INFO] - agents played in iteration 117 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:45:05,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:05,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:06,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:06,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:06,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:06,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:06,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:07,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:08,429][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:08,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:09,077][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:09,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:09,730][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:10,054][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:10,377][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:11,355][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:12,010][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:12,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:13,641][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:13,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:14,288][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:14,942][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:15,267][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:15,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:15,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:16,254][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:16,580][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:16,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:17,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:18,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:18,773][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:18,774][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:18,776][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:19,804][__main__][INFO] - Iteration 118 took 21s (32.92% Gen, 62.27% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 8m 1s. Estimated total time: 17h 49m 32s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 39s, 500 more iterations: 2h 58m 15s. [2025-11-13 08:45:19,806][__main__][INFO] - Starting iteration 118. [2025-11-13 08:45:19,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:19,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:26,844][__main__][INFO] - Number of regex retries in iteration 118: 0 [2025-11-13 08:45:26,845][__main__][INFO] - agents played in iteration 118 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:45:27,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:27,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:27,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:27,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:27,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:27,416][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:28,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:28,780][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:29,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:29,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:30,083][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:30,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:30,731][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:31,055][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:31,380][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:31,704][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:32,029][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:33,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:33,649][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:33,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:34,300][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:34,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:35,271][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:35,919][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:36,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:36,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:37,218][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:37,541][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:37,866][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:38,190][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:38,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:39,284][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:40,034][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:40,036][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:40,038][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:41,025][__main__][INFO] - Iteration 119 took 21s (33.16% Gen, 62.18% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 58m 58s. Estimated total time: 17h 40m 50s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 21s, 500 more iterations: 2h 56m 48s. [2025-11-13 08:45:41,027][__main__][INFO] - Starting iteration 119. [2025-11-13 08:45:41,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:41,030][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:47,967][__main__][INFO] - Number of regex retries in iteration 119: 0 [2025-11-13 08:45:47,968][__main__][INFO] - agents played in iteration 119 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:45:48,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:48,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:48,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:48,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:48,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:48,539][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:49,325][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:49,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:49,946][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:50,275][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:50,603][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:50,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:51,253][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:51,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:51,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:52,557][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:52,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:53,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:53,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:53,870][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:54,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:55,808][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:56,132][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:56,457][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:57,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:57,756][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:58,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:58,404][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:58,727][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:59,051][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:59,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:00,451][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:01,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:01,195][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:01,197][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:02,175][__main__][INFO] - Iteration 120 took 21s (32.80% Gen, 62.56% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 55m 5s. Estimated total time: 17h 37m 18s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 14s, 500 more iterations: 2h 56m 13s. [2025-11-13 08:46:02,177][__main__][INFO] - Starting iteration 120. [2025-11-13 08:46:02,182][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:46:02,182][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:09,291][__main__][INFO] - Number of regex retries in iteration 120: 0 [2025-11-13 08:46:09,291][__main__][INFO] - agents played in iteration 120 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:46:09,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:09,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:09,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:09,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:09,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:09,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:10,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:10,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:11,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:11,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:11,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:12,223][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:12,548][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:12,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:13,195][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:13,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:14,167][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:14,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:14,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:15,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:15,462][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:15,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:16,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:16,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:17,408][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:17,730][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:18,056][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:18,381][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:18,705][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:19,029][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:19,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:20,323][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:20,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:21,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:22,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:22,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:22,486][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:24,416][__main__][INFO] - Iteration 121 took 22s (31.97% Gen, 59.34% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 49m 11s. Estimated total time: 18h 31m 47s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 3s, 500 more iterations: 3h 5m 17s. [2025-11-13 08:46:24,418][__main__][INFO] - Starting iteration 121. [2025-11-13 08:46:24,421][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:24,422][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:31,948][__main__][INFO] - Number of regex retries in iteration 121: 0 [2025-11-13 08:46:31,949][__main__][INFO] - agents played in iteration 121 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:46:32,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:32,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:32,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:32,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:32,531][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:32,532][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:33,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:33,893][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:34,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:34,550][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:34,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:35,197][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:35,849][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:36,174][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:36,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:36,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:37,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:37,479][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:38,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:39,433][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:39,757][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:40,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:41,068][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:41,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:41,726][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:42,053][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:42,382][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:42,706][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:43,030][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:43,357][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:43,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:44,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:45,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:45,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:45,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:46,229][__main__][INFO] - Iteration 122 took 21s (34.51% Gen, 60.80% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 29s. Estimated total time: 18h 10m 26s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 20s, 500 more iterations: 3h 1m 44s. [2025-11-13 08:46:46,231][__main__][INFO] - Starting iteration 122. [2025-11-13 08:46:46,234][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:46,235][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:53,580][__main__][INFO] - Number of regex retries in iteration 122: 0 [2025-11-13 08:46:53,581][__main__][INFO] - agents played in iteration 122 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:46:54,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:54,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:54,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:54,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:54,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:54,151][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:55,231][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:55,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:56,208][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:56,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:56,864][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:57,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:57,835][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:58,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:58,484][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:58,814][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:59,793][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:00,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:01,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:01,749][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:02,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:02,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:03,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:03,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:04,028][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:04,998][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:05,324][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:06,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:06,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:06,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:06,869][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:07,897][__main__][INFO] - Iteration 123 took 21s (33.91% Gen, 61.34% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 19m 53s. Estimated total time: 18h 3m 12s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 6s, 500 more iterations: 3h 0m 32s. [2025-11-13 08:47:07,899][__main__][INFO] - Starting iteration 123. [2025-11-13 08:47:07,903][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:07,904][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:15,282][__main__][INFO] - Number of regex retries in iteration 123: 0 [2025-11-13 08:47:15,283][__main__][INFO] - agents played in iteration 123 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:47:15,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:15,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:15,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:15,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:15,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:15,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:16,617][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:17,239][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:17,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:17,894][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:18,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:18,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:19,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:19,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:19,853][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:20,499][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:20,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:21,152][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:21,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:21,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:22,466][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:23,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:23,763][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:24,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:24,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:25,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:26,029][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:26,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:26,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:27,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:27,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:28,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:28,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:28,553][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:29,522][__main__][INFO] - Iteration 124 took 21s (34.13% Gen, 61.38% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 17m 17s. Estimated total time: 18h 0m 58s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 1s, 500 more iterations: 3h 0m 9s. [2025-11-13 08:47:29,524][__main__][INFO] - Starting iteration 124. [2025-11-13 08:47:29,527][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:29,527][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:36,903][__main__][INFO] - Number of regex retries in iteration 124: 0 [2025-11-13 08:47:36,904][__main__][INFO] - agents played in iteration 124 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:47:37,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:37,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:37,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:37,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:37,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:37,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:38,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:39,196][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:40,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:40,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:40,818][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:41,143][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:41,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:41,792][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:42,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:42,440][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:42,764][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:43,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:43,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:44,060][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:44,384][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:44,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:45,033][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:45,357][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:45,680][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:46,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:46,656][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:47,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:47,628][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:47,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:48,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:48,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:49,390][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:50,170][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:50,172][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:50,174][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:51,166][__main__][INFO] - Iteration 125 took 21s (34.08% Gen, 61.32% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 17m 58s. Estimated total time: 18h 2m 0s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 4s, 500 more iterations: 3h 0m 20s. [2025-11-13 08:47:51,168][__main__][INFO] - Starting iteration 125. [2025-11-13 08:47:51,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:51,171][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:58,389][__main__][INFO] - Number of regex retries in iteration 125: 0 [2025-11-13 08:47:58,389][__main__][INFO] - agents played in iteration 125 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:47:58,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:58,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:58,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:58,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:58,958][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:58,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:59,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:00,013][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:00,328][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:00,979][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:01,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:01,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:02,274][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:02,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:03,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:04,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:04,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:04,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:05,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:05,519][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:05,849][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:06,175][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:06,503][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:06,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:07,156][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:07,479][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:07,804][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:08,457][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:08,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:09,110][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:10,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:10,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:11,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:11,591][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:11,592][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:12,614][__main__][INFO] - Iteration 126 took 21s (33.66% Gen, 61.57% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 48s. Estimated total time: 17h 52m 11s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 44s, 500 more iterations: 2h 58m 41s. [2025-11-13 08:48:12,616][__main__][INFO] - Starting iteration 126. [2025-11-13 08:48:12,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:12,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:20,006][__main__][INFO] - Number of regex retries in iteration 126: 0 [2025-11-13 08:48:20,006][__main__][INFO] - agents played in iteration 126 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:48:20,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:20,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:20,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:20,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:20,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:20,583][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:21,364][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:21,659][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:22,315][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:22,641][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:22,964][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:23,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:23,623][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:23,945][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:24,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:24,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:24,924][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:25,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:26,233][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:26,553][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:26,878][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:27,532][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:27,855][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:28,501][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:28,824][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:29,476][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:29,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:30,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:31,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:31,436][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:31,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:32,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:33,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:33,323][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:33,324][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:34,328][__main__][INFO] - Iteration 127 took 21s (34.02% Gen, 61.35% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 42s. Estimated total time: 18h 5m 28s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 10s, 500 more iterations: 3h 0m 54s. [2025-11-13 08:48:34,330][__main__][INFO] - Starting iteration 127. [2025-11-13 08:48:34,332][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:34,333][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:41,693][__main__][INFO] - Number of regex retries in iteration 127: 0 [2025-11-13 08:48:41,694][__main__][INFO] - agents played in iteration 127 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:48:42,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:42,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:42,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:42,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:42,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:42,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:43,017][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:43,319][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:43,959][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:44,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:44,607][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:44,932][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:45,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:45,906][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:46,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:46,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:46,878][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:47,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:47,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:48,502][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:49,156][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:49,483][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:50,781][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:51,110][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:52,087][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:52,412][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:52,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:53,069][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:53,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:54,173][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:54,936][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:54,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:54,939][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:55,939][__main__][INFO] - Iteration 128 took 21s (34.07% Gen, 61.30% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 15s. Estimated total time: 18h 0m 23s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 0s, 500 more iterations: 3h 0m 3s. [2025-11-13 08:48:55,941][__main__][INFO] - Starting iteration 128. [2025-11-13 08:48:55,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:55,946][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:03,381][__main__][INFO] - Number of regex retries in iteration 128: 0 [2025-11-13 08:49:03,381][__main__][INFO] - agents played in iteration 128 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:49:03,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:03,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:03,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:03,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:03,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:03,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:04,990][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:05,315][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:05,640][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:05,967][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:06,291][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:06,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:07,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:07,917][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:08,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:08,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:09,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:09,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:10,192][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:11,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:11,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:11,816][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:12,464][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:12,788][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:13,438][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:14,735][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:15,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:15,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:16,583][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:16,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:16,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:17,571][__main__][INFO] - Iteration 129 took 21s (34.38% Gen, 61.05% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 55s. Estimated total time: 18h 1m 24s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 2s, 500 more iterations: 3h 0m 14s. [2025-11-13 08:49:17,573][__main__][INFO] - Starting iteration 129. [2025-11-13 08:49:17,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:49:17,577][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:24,915][__main__][INFO] - Number of regex retries in iteration 129: 0 [2025-11-13 08:49:24,915][__main__][INFO] - agents played in iteration 129 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:49:25,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:25,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:25,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:25,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:25,496][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:25,497][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:26,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:27,206][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:27,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:27,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:28,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:28,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:28,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:29,164][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:29,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:29,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:30,136][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:30,464][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:30,791][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:31,116][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:31,445][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:31,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:32,751][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:33,726][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:34,050][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:35,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:35,349][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:35,675][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:36,005][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:36,334][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:36,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:37,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:38,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:38,153][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:38,155][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:39,135][__main__][INFO] - Iteration 130 took 21s (34.04% Gen, 61.41% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 12m 7s. Estimated total time: 17h 57m 57s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 55s, 500 more iterations: 2h 59m 39s. [2025-11-13 08:49:39,137][__main__][INFO] - Starting iteration 130. [2025-11-13 08:49:39,140][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:49:39,141][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:46,408][__main__][INFO] - Number of regex retries in iteration 130: 0 [2025-11-13 08:49:46,409][__main__][INFO] - agents played in iteration 130 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:49:46,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:46,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:46,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:46,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:46,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:46,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:47,721][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:48,016][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:48,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:48,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:49,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:49,644][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:49,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:50,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:50,946][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:51,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:52,244][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:52,893][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:53,870][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:54,194][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:54,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:55,168][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:55,821][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:56,469][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:56,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:57,116][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:57,761][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:58,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:58,829][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:59,563][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:59,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:59,569][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:01,509][__main__][INFO] - Iteration 131 took 22s (32.49% Gen, 58.83% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 52m 16s. Estimated total time: 18h 38m 29s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 16s, 500 more iterations: 3h 6m 24s. [2025-11-13 08:50:01,511][__main__][INFO] - Starting iteration 131. [2025-11-13 08:50:01,514][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:01,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:08,903][__main__][INFO] - Number of regex retries in iteration 131: 0 [2025-11-13 08:50:08,904][__main__][INFO] - agents played in iteration 131 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:50:09,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:09,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:09,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:09,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:09,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:09,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:10,255][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:10,550][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:10,875][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:11,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:12,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:12,512][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:12,836][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:13,165][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:13,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:14,141][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:14,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:14,794][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:15,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:15,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:15,772][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:16,424][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:16,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:17,073][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:17,398][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:17,722][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:18,372][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:19,343][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:19,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:20,318][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:20,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:21,388][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:22,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:22,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:22,145][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:23,155][__main__][INFO] - Iteration 132 took 21s (34.14% Gen, 61.19% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 30s. Estimated total time: 18h 2m 4s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 4s, 500 more iterations: 3h 0m 20s. [2025-11-13 08:50:23,157][__main__][INFO] - Starting iteration 132. [2025-11-13 08:50:23,160][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:23,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:31,083][__main__][INFO] - Number of regex retries in iteration 132: 0 [2025-11-13 08:50:31,083][__main__][INFO] - agents played in iteration 132 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:50:31,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:31,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:31,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:31,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:31,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:31,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:32,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:32,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:33,050][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:33,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:33,699][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:34,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:34,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:34,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:35,322][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:35,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:36,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:36,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:36,944][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:37,267][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:37,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:38,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:38,572][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:39,872][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:40,195][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:40,519][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:40,843][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:41,167][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:41,491][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:41,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:42,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:42,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:42,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:43,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:44,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:44,328][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:44,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:45,310][__main__][INFO] - Iteration 133 took 22s (35.77% Gen, 59.80% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 40m 36s. Estimated total time: 18h 27m 33s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 55s, 500 more iterations: 3h 4m 35s. [2025-11-13 08:50:45,316][__main__][INFO] - Starting iteration 133. [2025-11-13 08:50:45,318][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:45,319][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:53,021][__main__][INFO] - Number of regex retries in iteration 133: 0 [2025-11-13 08:50:53,022][__main__][INFO] - agents played in iteration 133 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:50:53,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:53,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:53,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:53,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:53,600][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:53,600][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:54,987][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:55,311][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:57,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:57,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:58,226][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:58,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:59,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:59,519][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:59,847][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:00,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:00,489][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:00,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:01,135][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:01,781][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:02,105][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:02,429][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:02,757][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:03,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:03,404][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:03,737][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:04,054][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:04,377][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:04,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:05,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:06,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:06,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:06,201][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:07,184][__main__][INFO] - Iteration 134 took 21s (35.22% Gen, 60.27% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 26m 1s. Estimated total time: 18h 13m 20s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 26s, 500 more iterations: 3h 2m 13s. [2025-11-13 08:51:07,187][__main__][INFO] - Starting iteration 134. [2025-11-13 08:51:07,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:07,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:14,663][__main__][INFO] - Number of regex retries in iteration 134: 0 [2025-11-13 08:51:14,664][__main__][INFO] - agents played in iteration 134 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:51:15,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:15,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:15,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:15,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:15,236][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:15,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:16,007][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:16,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:16,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:17,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:17,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:17,927][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:18,252][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:18,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:18,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:19,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:19,561][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:20,205][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:20,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:20,860][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:21,504][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:21,828][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:22,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:22,475][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:23,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:23,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:24,098][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:24,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:24,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:25,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:25,398][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:25,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:26,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:26,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:27,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:27,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:27,905][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:27,907][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:28,916][__main__][INFO] - Iteration 135 took 21s (34.40% Gen, 60.95% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 41s. Estimated total time: 18h 6m 21s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 12s, 500 more iterations: 3h 1m 3s. [2025-11-13 08:51:28,919][__main__][INFO] - Starting iteration 135. [2025-11-13 08:51:28,923][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:28,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:36,465][__main__][INFO] - Number of regex retries in iteration 135: 0 [2025-11-13 08:51:36,465][__main__][INFO] - agents played in iteration 135 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:51:36,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:36,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:37,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:37,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:37,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:37,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:37,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:38,120][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:38,444][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:38,768][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:39,092][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:39,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:40,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:41,037][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:41,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:42,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:42,335][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:42,982][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:43,305][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:43,628][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:43,952][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:44,275][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:44,600][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:44,923][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:45,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:45,571][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:45,894][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:46,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:46,865][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:47,188][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:47,511][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:47,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:48,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:48,924][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:49,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:49,672][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:49,674][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:50,690][__main__][INFO] - Iteration 136 took 21s (34.64% Gen, 60.68% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 23s. Estimated total time: 18h 8m 24s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 16s, 500 more iterations: 3h 1m 24s. [2025-11-13 08:51:50,693][__main__][INFO] - Starting iteration 136. [2025-11-13 08:51:50,696][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:50,696][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:58,354][__main__][INFO] - Number of regex retries in iteration 136: 0 [2025-11-13 08:51:58,355][__main__][INFO] - agents played in iteration 136 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:51:58,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:58,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:58,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:58,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:58,951][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:58,952][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:00,013][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:00,662][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:00,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:01,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:01,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:01,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:02,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:03,259][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:03,582][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:04,230][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:04,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:04,877][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:05,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:05,524][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:05,848][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:06,171][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:06,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:07,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:07,467][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:07,792][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:08,445][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:08,777][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:09,099][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:09,424][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:09,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:10,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:10,841][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:11,611][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:11,613][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:11,615][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:12,644][__main__][INFO] - Iteration 137 took 21s (34.89% Gen, 60.41% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 29m 4s. Estimated total time: 18h 17m 28s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 34s, 500 more iterations: 3h 2m 54s. [2025-11-13 08:52:12,649][__main__][INFO] - Starting iteration 137. [2025-11-13 08:52:12,652][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:12,653][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:20,450][__main__][INFO] - Number of regex retries in iteration 137: 0 [2025-11-13 08:52:20,451][__main__][INFO] - agents played in iteration 137 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:52:20,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:20,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:20,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:21,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:21,032][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:21,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:21,817][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:23,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:23,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:24,064][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:24,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:25,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:26,008][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:26,656][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:27,306][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:27,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:27,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:28,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:28,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:28,926][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:29,248][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:29,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:29,897][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:30,220][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:30,545][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:30,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:31,195][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:31,519][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:32,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:32,937][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:33,711][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:33,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:33,714][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:34,707][__main__][INFO] - Iteration 138 took 22s (35.36% Gen, 60.14% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 33m 59s. Estimated total time: 18h 22m 45s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 45s, 500 more iterations: 3h 3m 47s. [2025-11-13 08:52:34,709][__main__][INFO] - Starting iteration 138. [2025-11-13 08:52:34,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:34,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:42,518][__main__][INFO] - Number of regex retries in iteration 138: 0 [2025-11-13 08:52:42,519][__main__][INFO] - agents played in iteration 138 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:52:43,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:43,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:43,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:43,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:43,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:43,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:44,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:44,803][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:45,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:45,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:45,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:46,103][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:46,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:48,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:48,394][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:48,722][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:49,046][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:49,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:49,697][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:50,021][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:50,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:50,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:50,994][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:51,317][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:51,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:51,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:52,289][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:52,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:52,935][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:53,259][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:53,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:53,907][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:54,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:55,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:55,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:55,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:55,761][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:56,724][__main__][INFO] - Iteration 139 took 22s (35.46% Gen, 60.16% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 31m 30s. Estimated total time: 18h 20m 38s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 41s, 500 more iterations: 3h 3m 26s. [2025-11-13 08:52:56,725][__main__][INFO] - Starting iteration 139. [2025-11-13 08:52:56,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:56,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:04,014][__main__][INFO] - Number of regex retries in iteration 139: 0 [2025-11-13 08:53:04,015][__main__][INFO] - agents played in iteration 139 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:53:04,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:04,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:04,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:04,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:04,587][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:04,587][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:05,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:05,649][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:06,946][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:07,275][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:07,600][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:07,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:08,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:08,894][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:09,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:09,545][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:10,192][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:10,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:12,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:12,474][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:13,122][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:13,775][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:14,098][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:14,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:14,751][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:15,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:15,404][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:15,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:16,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:17,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:17,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:17,259][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:18,257][__main__][INFO] - Iteration 140 took 21s (33.84% Gen, 61.51% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 0s. Estimated total time: 17h 56m 30s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 53s, 500 more iterations: 2h 59m 25s. [2025-11-13 08:53:18,260][__main__][INFO] - Starting iteration 140. [2025-11-13 08:53:18,265][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:53:18,266][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:25,890][__main__][INFO] - Number of regex retries in iteration 140: 0 [2025-11-13 08:53:25,890][__main__][INFO] - agents played in iteration 140 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:53:26,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:26,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:26,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:26,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:26,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:26,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:27,550][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:27,875][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:28,524][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:28,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:29,505][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:29,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:30,161][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:31,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:31,784][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:32,430][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:32,755][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:33,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:33,727][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:34,377][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:34,700][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:35,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:35,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:36,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:36,653][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:36,978][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:37,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:37,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:38,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:39,203][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:39,204][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:39,206][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:41,218][__main__][INFO] - Iteration 141 took 22s (33.21% Gen, 58.01% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 17m 50s. Estimated total time: 19h 7m 42s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 17s. [2025-11-13 08:53:41,220][__main__][INFO] - Starting iteration 141. [2025-11-13 08:53:41,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:41,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:49,280][__main__][INFO] - Number of regex retries in iteration 141: 0 [2025-11-13 08:53:49,280][__main__][INFO] - agents played in iteration 141 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:53:49,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:49,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:49,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:49,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:49,880][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:49,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:50,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:50,933][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:51,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:51,584][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:51,908][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:52,232][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:52,881][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:53,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:53,856][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:54,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:54,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:54,827][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:55,485][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:55,810][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:56,135][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:56,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:56,791][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:57,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:57,439][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:57,771][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:58,095][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:58,419][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:59,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:59,390][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:00,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:00,370][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:01,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:01,736][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:02,478][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:02,480][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:02,482][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:03,482][__main__][INFO] - Iteration 142 took 22s (36.19% Gen, 59.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 42m 43s. Estimated total time: 18h 32m 58s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 5s, 500 more iterations: 3h 5m 29s. [2025-11-13 08:54:03,484][__main__][INFO] - Starting iteration 142. [2025-11-13 08:54:03,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:03,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:11,585][__main__][INFO] - Number of regex retries in iteration 142: 0 [2025-11-13 08:54:11,586][__main__][INFO] - agents played in iteration 142 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:54:12,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:12,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:12,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:12,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:12,171][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:12,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:12,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:13,221][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:13,547][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:15,182][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:15,505][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:16,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:16,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:16,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:17,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:17,794][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:18,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:18,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:18,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:19,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:19,748][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:20,406][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:21,067][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:21,391][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:21,714][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:22,038][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:22,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:22,685][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:23,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:23,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:24,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:24,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:24,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:24,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:25,811][__main__][INFO] - Iteration 143 took 22s (36.28% Gen, 59.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 45m 36s. Estimated total time: 18h 36m 13s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 12s, 500 more iterations: 3h 6m 2s. [2025-11-13 08:54:25,813][__main__][INFO] - Starting iteration 143. [2025-11-13 08:54:25,816][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:25,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:33,621][__main__][INFO] - Number of regex retries in iteration 143: 0 [2025-11-13 08:54:33,621][__main__][INFO] - agents played in iteration 143 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:54:34,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:34,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:34,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:34,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:34,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:34,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:35,597][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:35,921][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:36,244][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:36,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:36,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:37,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:37,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:38,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:38,523][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:38,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:39,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:39,495][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:39,819][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:40,147][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:40,473][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:40,797][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:41,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:42,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:42,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:42,751][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:43,078][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:43,726][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:44,049][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:44,373][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:45,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:45,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:46,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:46,871][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:46,873][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:46,875][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:47,912][__main__][INFO] - Iteration 144 took 22s (35.32% Gen, 59.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 33m 51s. Estimated total time: 18h 24m 50s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 49s, 500 more iterations: 3h 4m 8s. [2025-11-13 08:54:47,914][__main__][INFO] - Starting iteration 144. [2025-11-13 08:54:47,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:47,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:56,053][__main__][INFO] - Number of regex retries in iteration 144: 0 [2025-11-13 08:54:56,054][__main__][INFO] - agents played in iteration 144 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:54:56,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:56,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:56,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:56,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:56,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:56,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:57,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:57,693][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:58,018][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:58,343][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:58,671][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:58,997][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:59,649][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:59,974][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:00,303][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:00,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:00,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:01,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:01,932][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:02,256][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:02,583][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:02,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:03,233][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:03,556][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:03,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:04,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:04,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:04,858][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:05,511][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:05,842][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:06,168][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:07,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:07,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:07,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:08,554][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:09,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:09,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:09,317][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:10,324][__main__][INFO] - Iteration 145 took 22s (36.31% Gen, 59.19% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 49m 1s. Estimated total time: 18h 40m 23s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 20s, 500 more iterations: 3h 6m 43s. [2025-11-13 08:55:10,326][__main__][INFO] - Starting iteration 145. [2025-11-13 08:55:10,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:10,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:18,348][__main__][INFO] - Number of regex retries in iteration 145: 0 [2025-11-13 08:55:18,348][__main__][INFO] - agents played in iteration 145 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:55:18,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:18,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:18,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:18,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:18,935][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:18,936][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:19,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:20,937][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:21,261][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:21,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:22,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:22,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:23,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:23,540][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:24,193][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:24,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:25,499][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:25,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:26,148][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:26,473][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:26,800][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:28,098][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:28,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:28,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:29,079][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:29,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:30,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:30,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:31,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:31,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:31,571][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:32,534][__main__][INFO] - Iteration 146 took 22s (36.11% Gen, 59.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 38m 34s. Estimated total time: 18h 30m 17s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 0s, 500 more iterations: 3h 5m 2s. [2025-11-13 08:55:32,536][__main__][INFO] - Starting iteration 146. [2025-11-13 08:55:32,539][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:32,540][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:40,227][__main__][INFO] - Number of regex retries in iteration 146: 0 [2025-11-13 08:55:40,227][__main__][INFO] - agents played in iteration 146 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:55:40,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:40,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:40,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:40,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:40,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:40,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:41,564][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:41,859][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:42,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:43,156][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:44,128][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:44,452][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:44,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:45,101][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:45,425][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:46,722][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:47,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:47,696][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:48,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:48,669][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:48,993][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:49,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:50,294][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:50,619][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:50,943][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:51,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:51,594][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:51,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:52,695][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:53,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:53,449][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:53,450][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:54,420][__main__][INFO] - Iteration 147 took 21s (35.13% Gen, 60.43% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 21m 59s. Estimated total time: 18h 14m 4s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 28s, 500 more iterations: 3h 2m 20s. [2025-11-13 08:55:54,422][__main__][INFO] - Starting iteration 147. [2025-11-13 08:55:54,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:54,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:02,480][__main__][INFO] - Number of regex retries in iteration 147: 0 [2025-11-13 08:56:02,481][__main__][INFO] - agents played in iteration 147 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:56:02,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:02,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:03,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:03,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:03,056][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:03,056][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:04,457][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:04,792][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:05,442][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:06,423][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:06,746][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:07,071][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:07,395][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:08,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:08,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:09,022][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:09,347][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:09,671][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:10,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:10,649][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:10,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:11,298][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:11,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:12,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:12,600][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:12,925][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:13,250][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:13,576][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:13,901][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:14,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:15,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:15,780][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:15,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:15,784][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:16,753][__main__][INFO] - Iteration 148 took 22s (36.07% Gen, 59.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 43m 59s. Estimated total time: 18h 36m 27s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 12s, 500 more iterations: 3h 6m 4s. [2025-11-13 08:56:16,755][__main__][INFO] - Starting iteration 148. [2025-11-13 08:56:16,758][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:56:16,758][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:24,566][__main__][INFO] - Number of regex retries in iteration 148: 0 [2025-11-13 08:56:24,567][__main__][INFO] - agents played in iteration 148 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:56:25,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:25,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:25,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:25,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:25,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:25,139][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:25,909][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:26,530][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:26,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:27,176][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:27,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:28,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:28,800][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:29,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:29,448][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:30,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:30,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:30,743][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:31,068][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:31,391][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:32,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:32,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:32,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:33,011][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:33,336][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:33,661][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:33,985][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:34,634][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:34,959][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:35,284][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:35,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:35,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:36,259][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:37,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:37,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:37,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:37,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:38,696][__main__][INFO] - Iteration 149 took 21s (35.59% Gen, 60.14% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 7s. Estimated total time: 18h 16m 57s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 33s, 500 more iterations: 3h 2m 49s. [2025-11-13 08:56:38,698][__main__][INFO] - Starting iteration 149. [2025-11-13 08:56:38,701][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:56:38,701][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:46,576][__main__][INFO] - Number of regex retries in iteration 149: 0 [2025-11-13 08:56:46,577][__main__][INFO] - agents played in iteration 149 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:56:47,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:47,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:47,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:47,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:47,152][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:47,152][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:48,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:48,541][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:48,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:49,196][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:49,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:49,845][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:50,169][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:50,817][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:51,142][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:51,796][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:52,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:52,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:53,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:54,067][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:54,394][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:54,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:55,042][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:55,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:55,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:56,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:56,995][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:58,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:59,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:59,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:59,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:59,811][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:00,713][__main__][INFO] - Iteration 150 took 22s (35.78% Gen, 60.12% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 28s. Estimated total time: 18h 20m 40s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 41s, 500 more iterations: 3h 3m 26s. [2025-11-13 08:57:00,716][__main__][INFO] - Starting iteration 150. [2025-11-13 08:57:00,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:57:00,721][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:08,672][__main__][INFO] - Number of regex retries in iteration 150: 0 [2025-11-13 08:57:08,673][__main__][INFO] - agents played in iteration 150 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:57:09,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:09,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:09,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:09,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:09,254][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:09,254][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:10,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:10,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:10,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:10,965][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:11,289][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:11,620][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:11,936][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:12,261][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:12,588][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:13,237][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:13,561][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:13,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:14,536][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:15,183][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:16,162][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:16,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:18,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:18,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:18,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:19,420][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:19,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:20,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:20,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:21,191][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:21,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:21,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:21,959][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:23,825][__main__][INFO] - Iteration 151 took 23s (34.41% Gen, 57.50% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 21m 42s. Estimated total time: 19h 15m 17s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 32s. [2025-11-13 08:57:23,827][__main__][INFO] - Starting iteration 151. [2025-11-13 08:57:23,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:23,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:32,072][__main__][INFO] - Number of regex retries in iteration 151: 0 [2025-11-13 08:57:32,072][__main__][INFO] - agents played in iteration 151 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:57:32,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:32,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:32,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:32,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:32,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:32,668][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:33,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:34,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:34,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:34,710][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:35,035][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:35,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:36,014][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:36,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:37,315][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:37,639][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:37,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:38,290][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:38,613][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:38,939][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:39,267][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:39,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:39,917][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:40,242][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:40,566][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:40,891][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:41,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:41,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:41,875][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:42,202][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:42,856][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:43,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:43,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:44,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:45,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:45,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:45,339][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:46,293][__main__][INFO] - Iteration 152 took 22s (36.68% Gen, 59.06% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 49m 14s. Estimated total time: 18h 43m 11s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 26s, 500 more iterations: 3h 7m 11s. [2025-11-13 08:57:46,296][__main__][INFO] - Starting iteration 152. [2025-11-13 08:57:46,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:46,299][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:54,019][__main__][INFO] - Number of regex retries in iteration 152: 0 [2025-11-13 08:57:54,020][__main__][INFO] - agents played in iteration 152 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:57:54,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:54,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:54,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:54,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:54,590][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:54,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:55,362][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:55,658][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:56,312][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:56,641][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:56,967][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:57,292][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:57,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:57,939][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:58,267][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:58,597][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:58,921][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:59,248][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:59,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:59,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:00,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:00,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:00,871][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:01,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:01,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:01,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:02,173][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:02,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:03,154][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:03,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:03,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:04,455][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:04,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:05,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:06,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:07,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:07,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:07,226][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:08,225][__main__][INFO] - Iteration 153 took 21s (35.21% Gen, 60.22% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 4s. Estimated total time: 18h 16m 23s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 32s, 500 more iterations: 3h 2m 43s. [2025-11-13 08:58:08,232][__main__][INFO] - Starting iteration 153. [2025-11-13 08:58:08,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:08,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:16,069][__main__][INFO] - Number of regex retries in iteration 153: 0 [2025-11-13 08:58:16,069][__main__][INFO] - agents played in iteration 153 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:58:16,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:16,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:16,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:16,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:16,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:16,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:17,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:17,712][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:18,363][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:18,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:19,016][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:19,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:19,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:20,644][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:20,970][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:21,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:21,617][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:21,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:22,264][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:22,588][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:22,912][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:25,199][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:26,499][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:26,823][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:27,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:27,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:28,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:29,339][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:29,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:29,342][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:30,336][__main__][INFO] - Iteration 154 took 22s (35.44% Gen, 60.06% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 30m 21s. Estimated total time: 18h 25m 2s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 50s, 500 more iterations: 3h 4m 10s. [2025-11-13 08:58:30,338][__main__][INFO] - Starting iteration 154. [2025-11-13 08:58:30,341][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:30,341][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:38,106][__main__][INFO] - Number of regex retries in iteration 154: 0 [2025-11-13 08:58:38,106][__main__][INFO] - agents played in iteration 154 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:58:38,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:38,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:38,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:38,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:38,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:38,686][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:39,450][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:40,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:42,019][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:42,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:42,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:43,320][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:43,646][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:43,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:44,947][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:45,272][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:45,596][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:45,923][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:46,253][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:46,906][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:47,561][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:48,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:48,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:49,187][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:49,513][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:49,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:50,596][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:51,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:51,349][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:51,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:52,356][__main__][INFO] - Iteration 155 took 22s (35.27% Gen, 60.16% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 25m 45s. Estimated total time: 18h 20m 48s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 41s, 500 more iterations: 3h 3m 28s. [2025-11-13 08:58:52,359][__main__][INFO] - Starting iteration 155. [2025-11-13 08:58:52,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:52,362][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:59,707][__main__][INFO] - Number of regex retries in iteration 155: 0 [2025-11-13 08:58:59,708][__main__][INFO] - agents played in iteration 155 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:59:00,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:00,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:00,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:00,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:00,303][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:00,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:01,057][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:01,677][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:02,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:02,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:02,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:02,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:03,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:03,945][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:04,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:04,917][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:05,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:05,567][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:05,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:06,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:06,544][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:06,869][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:07,847][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:08,172][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:08,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:08,822][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:09,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:09,472][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:09,797][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:10,123][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:11,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:11,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:12,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:12,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:12,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:12,897][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:13,883][__main__][INFO] - Iteration 156 took 21s (34.13% Gen, 61.28% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 0m 42s. Estimated total time: 17h 56m 7s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 52s, 500 more iterations: 2h 59m 21s. [2025-11-13 08:59:13,885][__main__][INFO] - Starting iteration 156. [2025-11-13 08:59:13,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:13,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:21,749][__main__][INFO] - Number of regex retries in iteration 156: 0 [2025-11-13 08:59:21,750][__main__][INFO] - agents played in iteration 156 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:59:22,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:22,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:23,099][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:23,397][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:23,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:25,352][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:25,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:26,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:26,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:26,652][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:27,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:27,956][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:28,280][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:28,605][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:28,931][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:29,584][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:30,233][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:30,889][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:31,214][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:31,539][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:32,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:32,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:33,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:33,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:34,199][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:34,948][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:34,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:34,951][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:35,950][__main__][INFO] - Iteration 157 took 22s (35.63% Gen, 59.83% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 19s. Estimated total time: 18h 23m 6s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 46s, 500 more iterations: 3h 3m 51s. [2025-11-13 08:59:35,952][__main__][INFO] - Starting iteration 157. [2025-11-13 08:59:35,956][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:35,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:43,847][__main__][INFO] - Number of regex retries in iteration 157: 0 [2025-11-13 08:59:43,847][__main__][INFO] - agents played in iteration 157 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 08:59:44,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:44,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:44,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:44,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:44,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:44,418][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:45,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:45,485][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:45,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:46,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:46,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:47,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:48,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:48,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:48,749][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:49,074][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:49,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:50,051][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:50,378][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:51,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:51,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:51,683][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:52,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:52,332][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:52,658][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:52,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:53,312][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:53,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:53,962][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:54,287][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:54,613][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:55,262][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:55,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:56,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:57,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:57,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:57,094][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:58,172][__main__][INFO] - Iteration 158 took 22s (35.52% Gen, 59.62% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 34m 42s. Estimated total time: 18h 30m 52s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 1s, 500 more iterations: 3h 5m 8s. [2025-11-13 08:59:58,174][__main__][INFO] - Starting iteration 158. [2025-11-13 08:59:58,177][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:58,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:06,045][__main__][INFO] - Number of regex retries in iteration 158: 0 [2025-11-13 09:00:06,046][__main__][INFO] - agents played in iteration 158 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:00:06,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:06,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:06,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:06,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:06,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:06,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:07,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:08,028][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:08,353][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:08,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:09,017][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:09,343][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:09,669][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:10,010][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:10,336][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:10,661][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:11,320][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:11,645][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:11,974][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:12,625][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:13,277][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:13,934][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:14,261][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:15,238][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:15,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:15,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:16,551][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:17,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:17,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:18,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:19,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:19,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:19,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:20,317][__main__][INFO] - Iteration 159 took 22s (35.54% Gen, 59.97% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 30m 29s. Estimated total time: 18h 27m 0s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 54s, 500 more iterations: 3h 4m 30s. [2025-11-13 09:00:20,320][__main__][INFO] - Starting iteration 159. [2025-11-13 09:00:20,324][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 09:00:20,325][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:27,857][__main__][INFO] - Number of regex retries in iteration 159: 0 [2025-11-13 09:00:27,858][__main__][INFO] - agents played in iteration 159 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:00:28,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:28,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:28,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:28,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:28,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:28,433][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:29,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:29,834][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:30,158][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:30,808][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:31,133][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:31,459][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:31,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:32,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:32,761][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:33,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:34,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:34,711][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:35,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:35,684][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:36,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:37,306][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:37,959][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:38,614][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:39,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:40,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:41,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:41,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:41,069][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:42,068][__main__][INFO] - Iteration 160 took 21s (34.64% Gen, 60.76% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 10m 21s. Estimated total time: 18h 7m 14s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 14s, 500 more iterations: 3h 1m 12s. [2025-11-13 09:00:42,070][__main__][INFO] - Starting iteration 160. [2025-11-13 09:00:42,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 09:00:42,074][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:49,590][__main__][INFO] - Number of regex retries in iteration 160: 0 [2025-11-13 09:00:49,591][__main__][INFO] - agents played in iteration 160 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:00:50,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:50,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:50,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:50,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:50,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:50,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:50,960][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:51,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:51,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:52,893][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:53,544][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:54,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:54,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:54,850][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:55,499][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:55,829][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:56,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:56,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:56,801][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:57,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:57,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:57,773][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:58,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:58,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:59,394][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:59,717][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:00,044][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:00,369][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:01,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:02,055][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:02,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:02,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:02,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:04,723][__main__][INFO] - Iteration 161 took 22s (33.18% Gen, 58.33% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 55m 15s. Estimated total time: 18h 52m 30s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 45s. [2025-11-13 09:01:04,725][__main__][INFO] - Starting iteration 161. [2025-11-13 09:01:04,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:04,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:12,945][__main__][INFO] - Number of regex retries in iteration 161: 0 [2025-11-13 09:01:12,946][__main__][INFO] - agents played in iteration 161 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:01:13,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:13,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:13,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:13,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:13,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:13,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:14,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:15,244][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:15,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:16,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:16,866][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:17,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:18,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:19,154][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:19,803][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:20,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:20,464][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:20,793][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:21,118][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:21,447][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:21,772][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:22,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:23,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:23,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:24,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:24,371][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:24,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:25,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:26,167][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:26,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:26,170][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:27,184][__main__][INFO] - Iteration 162 took 22s (36.59% Gen, 58.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 45m 13s. Estimated total time: 18h 42m 51s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 25s, 500 more iterations: 3h 7m 8s. [2025-11-13 09:01:27,186][__main__][INFO] - Starting iteration 162. [2025-11-13 09:01:27,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:27,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:35,310][__main__][INFO] - Number of regex retries in iteration 162: 0 [2025-11-13 09:01:35,310][__main__][INFO] - agents played in iteration 162 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:01:35,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:35,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:35,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:35,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:35,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:35,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:36,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:37,277][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:37,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:38,581][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:38,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:39,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:39,886][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:40,208][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:40,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:40,865][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:41,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:41,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:42,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:43,163][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:43,812][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:44,147][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:44,793][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:45,118][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:45,453][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:45,776][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:46,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:46,424][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:46,758][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:47,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:47,803][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:48,533][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:48,534][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:48,536][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:49,542][__main__][INFO] - Iteration 163 took 22s (36.32% Gen, 59.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 39m 39s. Estimated total time: 18h 37m 40s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 15s, 500 more iterations: 3h 6m 16s. [2025-11-13 09:01:49,546][__main__][INFO] - Starting iteration 163. [2025-11-13 09:01:49,549][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:49,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:57,463][__main__][INFO] - Number of regex retries in iteration 163: 0 [2025-11-13 09:01:57,463][__main__][INFO] - agents played in iteration 163 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:01:57,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:57,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:58,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:58,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:58,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:58,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:58,810][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:59,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:59,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:59,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:01,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:01,388][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:01,717][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:02,040][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:02,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:02,696][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:03,021][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:03,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:04,326][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:04,652][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:05,628][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:05,953][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:06,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:07,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:08,221][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:08,549][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:08,873][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:09,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:09,929][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:10,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:10,665][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:10,666][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:11,640][__main__][INFO] - Iteration 164 took 22s (35.82% Gen, 59.76% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 26m 13s. Estimated total time: 18h 24m 36s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 49s, 500 more iterations: 3h 4m 6s. [2025-11-13 09:02:11,643][__main__][INFO] - Starting iteration 164. [2025-11-13 09:02:11,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:11,647][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:19,197][__main__][INFO] - Number of regex retries in iteration 164: 0 [2025-11-13 09:02:19,197][__main__][INFO] - agents played in iteration 164 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:02:19,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:19,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:19,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:19,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:19,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:19,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:20,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:20,864][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:21,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:22,493][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:23,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:23,465][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:24,114][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:24,438][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:24,767][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:25,751][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:26,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:26,723][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:27,049][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:27,698][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:28,022][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:28,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:28,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:29,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:29,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:29,967][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:30,290][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:30,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:30,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:31,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:32,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:32,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:32,444][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:33,427][__main__][INFO] - Iteration 165 took 21s (34.66% Gen, 60.81% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 10m 21s. Estimated total time: 18h 9m 5s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 18s, 500 more iterations: 3h 1m 30s. [2025-11-13 09:02:33,429][__main__][INFO] - Starting iteration 165. [2025-11-13 09:02:33,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:33,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:41,535][__main__][INFO] - Number of regex retries in iteration 165: 0 [2025-11-13 09:02:41,536][__main__][INFO] - agents played in iteration 165 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:02:42,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:42,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:42,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:42,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:42,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:42,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:42,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:43,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:43,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:44,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:44,805][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:45,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:45,778][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:46,103][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:46,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:46,754][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:49,026][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:49,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:49,674][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:50,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:50,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:50,975][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:51,299][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:51,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:51,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:52,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:52,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:52,929][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:53,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:54,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:54,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:54,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:54,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:55,770][__main__][INFO] - Iteration 166 took 22s (36.27% Gen, 59.21% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 45s. Estimated total time: 18h 36m 52s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 13s, 500 more iterations: 3h 6m 8s. [2025-11-13 09:02:55,772][__main__][INFO] - Starting iteration 166. [2025-11-13 09:02:55,776][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:55,777][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:03,516][__main__][INFO] - Number of regex retries in iteration 166: 0 [2025-11-13 09:03:03,517][__main__][INFO] - agents played in iteration 166 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:03:03,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:04,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:04,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:04,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:04,083][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:04,084][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:04,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:05,478][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:05,804][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:06,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:06,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:06,779][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:07,103][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:07,428][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:08,076][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:08,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:08,728][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:09,374][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:09,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:10,023][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:10,347][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:10,670][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:11,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:11,320][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:11,646][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:11,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:12,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:12,617][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:12,940][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:13,265][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:13,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:14,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:14,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:14,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:15,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:15,936][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:16,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:16,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:16,682][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:17,686][__main__][INFO] - Iteration 167 took 21s (35.32% Gen, 60.08% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 6s. Estimated total time: 18h 15m 34s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 31s, 500 more iterations: 3h 2m 35s. [2025-11-13 09:03:17,688][__main__][INFO] - Starting iteration 167. [2025-11-13 09:03:17,691][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:17,692][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:25,433][__main__][INFO] - Number of regex retries in iteration 167: 0 [2025-11-13 09:03:25,434][__main__][INFO] - agents played in iteration 167 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:03:25,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:25,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:25,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:26,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:26,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:26,008][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:26,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:27,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:27,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:28,049][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:29,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:30,009][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:30,338][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:30,663][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:30,986][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:31,311][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:31,965][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:32,624][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:32,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:33,280][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:33,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:33,932][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:34,257][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:35,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:35,563][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:35,894][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:36,216][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:36,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:36,868][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:37,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:37,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:38,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:38,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:38,661][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:39,701][__main__][INFO] - Iteration 168 took 22s (35.17% Gen, 60.09% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 41s. Estimated total time: 18h 20m 32s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 41s, 500 more iterations: 3h 3m 25s. [2025-11-13 09:03:39,703][__main__][INFO] - Starting iteration 168. [2025-11-13 09:03:39,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:39,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:47,605][__main__][INFO] - Number of regex retries in iteration 168: 0 [2025-11-13 09:03:47,606][__main__][INFO] - agents played in iteration 168 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:03:48,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:48,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:48,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:48,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:48,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:48,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:48,955][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:49,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:49,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:50,228][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:50,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:50,878][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:51,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:51,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:51,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:53,163][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:54,135][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:54,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:54,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:55,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:56,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:57,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:57,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:58,355][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:59,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:00,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:00,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:00,854][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:00,856][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:01,859][__main__][INFO] - Iteration 169 took 22s (35.66% Gen, 59.81% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 26s. Estimated total time: 18h 27m 39s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 55s, 500 more iterations: 3h 4m 36s. [2025-11-13 09:04:01,861][__main__][INFO] - Starting iteration 169. [2025-11-13 09:04:01,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:04:01,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:06,468][mllm.models.large_language_model_local][WARNING] - Response A> did not match regex: (|), retry 1/1 [2025-11-13 09:04:09,813][__main__][INFO] - Number of regex retries in iteration 169: 1 [2025-11-13 09:04:09,814][__main__][INFO] - agents played in iteration 169 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:04:10,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:10,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:10,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:10,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:10,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:10,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:11,149][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:11,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:12,095][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:12,750][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:13,086][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:13,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:14,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:14,396][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:14,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:15,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:15,369][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:16,025][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:16,350][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:16,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:17,332][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:17,658][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:18,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:18,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:18,962][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:19,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:19,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:20,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:20,920][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:21,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:22,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:23,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:23,070][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:23,071][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:24,075][__main__][INFO] - Iteration 170 took 22s (35.79% Gen, 59.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 30m 0s. Estimated total time: 18h 30m 35s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 1s, 500 more iterations: 3h 5m 5s. [2025-11-13 09:04:24,077][__main__][INFO] - Starting iteration 170. [2025-11-13 09:04:24,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:04:24,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:32,178][__main__][INFO] - Number of regex retries in iteration 170: 0 [2025-11-13 09:04:32,179][__main__][INFO] - agents played in iteration 170 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:04:32,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:32,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:32,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:32,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:32,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:32,769][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:33,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:33,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:34,144][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:34,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:35,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:35,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:35,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:37,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:37,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:37,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:38,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:38,369][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:39,022][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:39,341][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:39,665][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:39,989][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:40,320][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:40,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:41,615][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:42,585][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:43,232][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:43,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:43,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:44,645][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:45,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:45,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:45,396][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:47,376][__main__][INFO] - Iteration 171 took 23s (34.76% Gen, 56.73% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 53s. Estimated total time: 19h 24m 52s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 8s. [2025-11-13 09:04:47,379][__main__][INFO] - Starting iteration 171. [2025-11-13 09:04:47,382][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:47,382][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:55,771][__main__][INFO] - Number of regex retries in iteration 171: 0 [2025-11-13 09:04:55,771][__main__][INFO] - agents played in iteration 171 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:04:56,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:56,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:56,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:56,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:56,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:56,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:57,374][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:58,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:59,013][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:59,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:00,319][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:00,642][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:00,967][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:01,620][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:01,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:02,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:02,917][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:03,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:03,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:03,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:04,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:04,865][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:05,519][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:05,843][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:06,167][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:06,490][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:06,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:07,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:07,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:08,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:08,932][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:08,933][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:08,935][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:09,928][__main__][INFO] - Iteration 172 took 22s (37.20% Gen, 58.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 46m 1s. Estimated total time: 18h 47m 22s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 34s, 500 more iterations: 3h 7m 53s. [2025-11-13 09:05:09,930][__main__][INFO] - Starting iteration 172. [2025-11-13 09:05:09,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:09,935][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:18,456][__main__][INFO] - Number of regex retries in iteration 172: 0 [2025-11-13 09:05:18,457][__main__][INFO] - agents played in iteration 172 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:05:18,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:18,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:18,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:19,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:19,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:19,027][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:20,111][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:20,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:20,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:21,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:21,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:22,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:22,716][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:23,038][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:23,362][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:23,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:24,016][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:24,339][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:25,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:25,639][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:26,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:26,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:27,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:27,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:27,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:28,237][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:28,565][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:29,859][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:30,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:30,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:31,662][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:31,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:31,665][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:32,654][__main__][INFO] - Iteration 173 took 22s (37.51% Gen, 58.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 54m 17s. Estimated total time: 18h 56m 0s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 52s, 500 more iterations: 3h 9m 20s. [2025-11-13 09:05:32,656][__main__][INFO] - Starting iteration 173. [2025-11-13 09:05:32,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:32,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:41,217][__main__][INFO] - Number of regex retries in iteration 173: 0 [2025-11-13 09:05:41,217][__main__][INFO] - agents played in iteration 173 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:05:41,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:41,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:41,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:41,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:41,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:41,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:42,548][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:43,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:43,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:44,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:44,468][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:44,791][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:45,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:45,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:45,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:46,086][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:46,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:46,739][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:47,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:47,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:47,709][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:49,009][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:49,334][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:49,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:50,305][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:50,629][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:50,954][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:51,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:51,609][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:51,934][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:52,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:52,582][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:52,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:53,625][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:54,367][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:54,369][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:54,370][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:55,364][__main__][INFO] - Iteration 174 took 22s (37.69% Gen, 57.93% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 53m 11s. Estimated total time: 18h 55m 17s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 12s. [2025-11-13 09:05:55,366][__main__][INFO] - Starting iteration 174. [2025-11-13 09:05:55,370][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:55,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:03,858][__main__][INFO] - Number of regex retries in iteration 174: 0 [2025-11-13 09:06:03,859][__main__][INFO] - agents played in iteration 174 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:06:04,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:04,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:05,499][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:05,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:06,473][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:07,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:07,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:08,096][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:08,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:08,744][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:09,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:09,393][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:09,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:10,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:10,365][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:11,019][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:11,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:11,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:12,312][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:12,633][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:12,956][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:13,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:13,604][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:13,929][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:14,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:14,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:15,227][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:15,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:16,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:17,026][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:17,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:17,030][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:18,032][__main__][INFO] - Iteration 175 took 22s (37.45% Gen, 58.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 50m 40s. Estimated total time: 18h 53m 9s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 51s. [2025-11-13 09:06:18,034][__main__][INFO] - Starting iteration 175. [2025-11-13 09:06:18,038][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:18,038][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:26,881][__main__][INFO] - Number of regex retries in iteration 175: 0 [2025-11-13 09:06:26,881][__main__][INFO] - agents played in iteration 175 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:06:27,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:27,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:28,263][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:29,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:30,184][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:30,509][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:30,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:31,158][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:31,807][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:32,133][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:32,458][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:32,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:33,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:33,432][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:33,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:34,086][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:34,736][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:36,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:36,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:37,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:37,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:38,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:39,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:40,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:40,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:40,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:41,169][__main__][INFO] - Iteration 176 took 23s (38.23% Gen, 57.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 13m 43s. Estimated total time: 19h 16m 35s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 45s. [2025-11-13 09:06:41,171][__main__][INFO] - Starting iteration 176. [2025-11-13 09:06:41,174][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:41,174][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:49,656][__main__][INFO] - Number of regex retries in iteration 176: 0 [2025-11-13 09:06:49,657][__main__][INFO] - agents played in iteration 176 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:06:50,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,283][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:50,283][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:51,995][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:52,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:52,641][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:53,289][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:53,938][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:54,266][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:54,584][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:54,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:55,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:55,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:56,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:56,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:57,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:57,505][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:58,160][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:58,800][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:59,123][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:59,447][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:59,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:00,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:00,418][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:00,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:01,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:02,103][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:02,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:02,875][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:02,877][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:03,873][__main__][INFO] - Iteration 177 took 22s (37.37% Gen, 58.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 51m 46s. Estimated total time: 18h 55m 1s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 10s. [2025-11-13 09:07:03,875][__main__][INFO] - Starting iteration 177. [2025-11-13 09:07:03,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:03,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:12,803][__main__][INFO] - Number of regex retries in iteration 177: 0 [2025-11-13 09:07:12,804][__main__][INFO] - agents played in iteration 177 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:07:13,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,384][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:13,384][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:14,772][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:15,430][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:15,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:16,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:16,408][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:16,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:17,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:17,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:18,041][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:18,366][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:18,691][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:19,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:19,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:19,662][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:19,986][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:20,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:21,282][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:21,929][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:22,256][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:22,584][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:22,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:23,230][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:23,557][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:23,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:24,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:25,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:26,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:26,009][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:26,010][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:27,033][__main__][INFO] - Iteration 178 took 23s (38.54% Gen, 57.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 14m 6s. Estimated total time: 19h 17m 44s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 57s. [2025-11-13 09:07:27,035][__main__][INFO] - Starting iteration 178. [2025-11-13 09:07:27,038][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:27,039][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:35,951][__main__][INFO] - Number of regex retries in iteration 178: 0 [2025-11-13 09:07:35,952][__main__][INFO] - agents played in iteration 178 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:07:36,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,540][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:36,541][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:37,319][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:37,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:38,599][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:38,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:39,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:39,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:39,908][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:40,232][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:40,562][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:40,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:41,210][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:41,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:42,181][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:42,506][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:43,157][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:43,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:43,807][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:44,456][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:44,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:45,103][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:45,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:46,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:46,401][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:46,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:47,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:47,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:48,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:49,228][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:49,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:49,231][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:50,243][__main__][INFO] - Iteration 179 took 23s (38.41% Gen, 57.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 16s. Estimated total time: 19h 20m 18s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 23s. [2025-11-13 09:07:50,245][__main__][INFO] - Starting iteration 179. [2025-11-13 09:07:50,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:50,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:58,722][__main__][INFO] - Number of regex retries in iteration 179: 0 [2025-11-13 09:07:58,722][__main__][INFO] - agents played in iteration 179 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:07:59,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,303][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:59,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:00,085][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:00,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:00,705][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:01,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:01,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:02,004][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:02,334][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:02,981][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:03,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:03,633][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:03,957][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:04,280][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:04,603][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:04,928][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:05,252][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:05,576][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:06,223][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:06,548][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:06,872][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:07,842][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:08,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:08,815][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:09,140][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:09,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:09,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:10,114][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:10,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:11,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:11,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:11,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:11,923][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:12,924][__main__][INFO] - Iteration 180 took 22s (37.36% Gen, 58.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 49m 26s. Estimated total time: 18h 53m 50s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 47s, 500 more iterations: 3h 8m 58s. [2025-11-13 09:08:12,926][__main__][INFO] - Starting iteration 180. [2025-11-13 09:08:12,930][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:08:12,931][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:21,192][__main__][INFO] - Number of regex retries in iteration 180: 0 [2025-11-13 09:08:21,193][__main__][INFO] - agents played in iteration 180 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:08:21,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:21,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:21,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:21,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:21,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:21,774][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:22,543][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:23,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:23,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:24,149][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:24,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:24,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:25,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:27,078][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:27,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:28,366][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:29,017][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:29,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:29,669][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:30,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:30,642][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:31,291][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:31,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:32,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:32,587][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:32,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:33,669][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:34,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:34,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:34,421][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:36,457][__main__][INFO] - Iteration 181 took 23s (35.12% Gen, 56.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 31m 36s. Estimated total time: 19h 36m 23s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 3s. [2025-11-13 09:08:36,459][__main__][INFO] - Starting iteration 181. [2025-11-13 09:08:36,463][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:36,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:45,358][__main__][INFO] - Number of regex retries in iteration 181: 0 [2025-11-13 09:08:45,358][__main__][INFO] - agents played in iteration 181 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:08:45,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:45,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:46,695][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:46,990][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:47,317][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:47,644][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:47,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:48,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:48,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:49,265][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:49,591][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:50,567][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:50,887][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:51,213][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:51,869][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:52,837][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:53,164][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:53,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:53,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:54,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:55,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:56,402][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:56,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:57,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:57,784][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:58,551][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:58,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:58,554][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:59,603][__main__][INFO] - Iteration 182 took 23s (38.31% Gen, 57.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 11m 53s. Estimated total time: 19h 17m 3s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 50s. [2025-11-13 09:08:59,605][__main__][INFO] - Starting iteration 182. [2025-11-13 09:08:59,609][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:59,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:08,055][__main__][INFO] - Number of regex retries in iteration 182: 0 [2025-11-13 09:09:08,056][__main__][INFO] - agents played in iteration 182 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:09:08,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:08,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:09,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:10,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:10,682][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:11,006][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:11,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:11,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:11,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:13,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:13,931][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:14,254][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:14,580][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:15,227][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:15,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:15,875][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:16,523][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:16,850][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:17,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:17,836][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:18,164][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:18,493][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:18,819][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:19,143][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:19,465][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:19,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:20,534][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:21,275][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:21,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:21,282][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:22,286][__main__][INFO] - Iteration 183 took 22s (37.24% Gen, 58.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 48m 20s. Estimated total time: 18h 53m 53s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 47s, 500 more iterations: 3h 8m 58s. [2025-11-13 09:09:22,288][__main__][INFO] - Starting iteration 183. [2025-11-13 09:09:22,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:22,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:31,355][__main__][INFO] - Number of regex retries in iteration 183: 0 [2025-11-13 09:09:31,356][__main__][INFO] - agents played in iteration 183 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:09:31,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:31,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:32,700][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:33,649][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:33,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:34,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:34,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:35,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:36,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:36,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:36,895][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:37,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:38,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:38,843][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:39,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:39,491][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:40,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:40,788][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:41,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:41,758][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:42,083][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:42,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:43,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:43,793][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:44,543][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:44,544][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:44,546][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:45,573][__main__][INFO] - Iteration 184 took 23s (38.92% Gen, 56.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 6s. Estimated total time: 19h 24m 2s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 0s. [2025-11-13 09:09:45,575][__main__][INFO] - Starting iteration 184. [2025-11-13 09:09:45,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:45,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:54,095][__main__][INFO] - Number of regex retries in iteration 184: 0 [2025-11-13 09:09:54,096][__main__][INFO] - agents played in iteration 184 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:09:54,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:54,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:55,715][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:56,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:56,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:57,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:57,355][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:57,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:58,003][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:58,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:58,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:00,276][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:00,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:01,572][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:01,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:02,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:02,545][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:02,872][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:03,197][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:03,521][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:03,845][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:04,169][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:04,493][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:04,816][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:05,140][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:05,465][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:05,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:06,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:07,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:07,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:07,284][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:08,293][__main__][INFO] - Iteration 185 took 22s (37.49% Gen, 58.06% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 49m 28s. Estimated total time: 18h 55m 48s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 51s, 500 more iterations: 3h 9m 18s. [2025-11-13 09:10:08,295][__main__][INFO] - Starting iteration 185. [2025-11-13 09:10:08,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:08,299][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:16,862][__main__][INFO] - Number of regex retries in iteration 185: 0 [2025-11-13 09:10:16,863][__main__][INFO] - agents played in iteration 185 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:10:17,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:17,458][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:18,520][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:19,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:19,504][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:19,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:20,483][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:20,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:21,135][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:21,462][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:21,785][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:22,765][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:23,417][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:23,740][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:24,073][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:24,396][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:24,720][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:25,045][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:25,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:25,697][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:26,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:26,672][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:27,320][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:27,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:27,969][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:28,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:29,347][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:30,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:30,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:30,080][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:31,060][__main__][INFO] - Iteration 186 took 22s (37.62% Gen, 58.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 51m 25s. Estimated total time: 18h 58m 8s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 41s. [2025-11-13 09:10:31,063][__main__][INFO] - Starting iteration 186. [2025-11-13 09:10:31,066][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:31,067][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:39,913][__main__][INFO] - Number of regex retries in iteration 186: 0 [2025-11-13 09:10:39,913][__main__][INFO] - agents played in iteration 186 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:10:40,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,462][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:40,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:41,231][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:41,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:41,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:42,182][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:42,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:43,489][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:44,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:44,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:45,123][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:45,772][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:46,095][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:46,419][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:46,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:47,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:47,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:48,378][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:48,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:49,355][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:49,681][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:50,334][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:50,983][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:51,310][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:51,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:52,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:53,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:53,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:53,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:54,055][__main__][INFO] - Iteration 187 took 22s (38.48% Gen, 57.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 2m 24s. Estimated total time: 19h 9m 30s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 35s. [2025-11-13 09:10:54,057][__main__][INFO] - Starting iteration 187. [2025-11-13 09:10:54,062][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:54,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:01,923][__main__][INFO] - Number of regex retries in iteration 187: 0 [2025-11-13 09:11:01,923][__main__][INFO] - agents played in iteration 187 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:11:02,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:02,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:02,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:02,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:02,501][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:02,501][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:03,863][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:04,518][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:04,843][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:05,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:05,492][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:05,816][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:06,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:06,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:07,786][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:08,433][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:08,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:09,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:09,741][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:10,065][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:10,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:11,050][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:11,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:12,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:12,358][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:12,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:13,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:14,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:15,151][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:15,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:15,154][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:16,172][__main__][INFO] - Iteration 188 took 22s (35.55% Gen, 59.84% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 7s. Estimated total time: 18h 25m 34s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 51s, 500 more iterations: 3h 4m 15s. [2025-11-13 09:11:16,174][__main__][INFO] - Starting iteration 188. [2025-11-13 09:11:16,177][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:11:16,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:24,576][__main__][INFO] - Number of regex retries in iteration 188: 0 [2025-11-13 09:11:24,577][__main__][INFO] - agents played in iteration 188 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:11:25,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:25,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:25,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:25,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:25,152][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:25,153][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:25,890][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:26,196][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:26,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:26,849][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:27,828][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:28,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:29,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:29,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:29,781][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:30,105][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:30,759][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:31,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:31,405][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:31,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:32,378][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:32,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:33,027][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:33,351][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:33,675][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:34,008][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:34,656][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:34,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:35,628][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:35,951][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:36,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:37,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:37,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:37,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:37,773][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:38,716][__main__][INFO] - Iteration 189 took 22s (37.26% Gen, 58.55% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 39m 8s. Estimated total time: 18h 46m 58s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 33s, 500 more iterations: 3h 7m 49s. [2025-11-13 09:11:38,718][__main__][INFO] - Starting iteration 189. [2025-11-13 09:11:38,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:11:38,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:47,306][__main__][INFO] - Number of regex retries in iteration 189: 0 [2025-11-13 09:11:47,307][__main__][INFO] - agents played in iteration 189 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:11:47,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:47,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:47,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:47,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:47,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:47,866][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:48,606][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:48,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:49,228][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:49,881][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:50,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:50,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:50,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:51,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:52,157][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:52,487][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:52,811][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:53,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:53,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:53,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:54,432][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:55,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:55,404][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:56,052][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:56,376][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:57,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:57,347][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:57,672][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:57,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:58,323][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:58,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:59,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:00,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:00,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:00,443][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:01,443][__main__][INFO] - Iteration 190 took 22s (37.78% Gen, 57.82% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 47m 52s. Estimated total time: 18h 56m 4s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 52s, 500 more iterations: 3h 9m 20s. [2025-11-13 09:12:01,445][__main__][INFO] - Starting iteration 190. [2025-11-13 09:12:01,448][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:12:01,449][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:09,848][__main__][INFO] - Number of regex retries in iteration 190: 0 [2025-11-13 09:12:09,849][__main__][INFO] - agents played in iteration 190 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:12:10,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:10,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:10,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:10,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:10,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:10,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:11,173][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:11,797][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:12,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:12,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:13,430][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:13,754][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:14,402][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:14,726][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:15,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:15,699][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:16,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:16,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:17,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:17,647][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:17,971][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:18,295][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:18,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:18,947][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:19,271][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:19,597][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:19,920][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:20,568][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:20,892][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:21,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:21,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:22,257][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:23,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:23,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:23,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:24,923][__main__][INFO] - Iteration 191 took 23s (35.78% Gen, 56.09% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 11s. Estimated total time: 19h 33m 47s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 37s. [2025-11-13 09:12:24,925][__main__][INFO] - Starting iteration 191. [2025-11-13 09:12:24,929][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:24,929][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:33,889][__main__][INFO] - Number of regex retries in iteration 191: 0 [2025-11-13 09:12:33,890][__main__][INFO] - agents played in iteration 191 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:12:34,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:34,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:34,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:34,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:34,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:34,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:35,202][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:35,498][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:35,827][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:36,484][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:36,815][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:37,141][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:38,440][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:39,416][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:39,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:40,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:40,724][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:41,052][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:41,381][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:42,702][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:43,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:43,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:43,686][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:44,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:44,340][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:44,671][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:45,002][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:45,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:45,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:46,390][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:47,137][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:47,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:47,140][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:48,091][__main__][INFO] - Iteration 192 took 23s (38.68% Gen, 57.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 9m 10s. Estimated total time: 19h 18m 10s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 1s. [2025-11-13 09:12:48,093][__main__][INFO] - Starting iteration 192. [2025-11-13 09:12:48,096][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:48,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:56,857][__main__][INFO] - Number of regex retries in iteration 192: 0 [2025-11-13 09:12:56,857][__main__][INFO] - agents played in iteration 192 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:12:57,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:57,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:57,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:57,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:57,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:57,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:58,195][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:58,492][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:59,145][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:00,119][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:00,773][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:01,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:01,745][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:02,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:02,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:03,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:03,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:03,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:04,022][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:04,346][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:04,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:05,642][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:05,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:06,294][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:06,621][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:07,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:07,927][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:08,252][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:08,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:09,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:10,033][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:10,036][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:10,037][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:10,971][__main__][INFO] - Iteration 193 took 22s (38.30% Gen, 57.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 54m 28s. Estimated total time: 19h 3m 50s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 38s. [2025-11-13 09:13:10,974][__main__][INFO] - Starting iteration 193. [2025-11-13 09:13:10,977][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:10,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:19,745][__main__][INFO] - Number of regex retries in iteration 193: 0 [2025-11-13 09:13:19,745][__main__][INFO] - agents played in iteration 193 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:13:20,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:20,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:20,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:20,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:20,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:20,309][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:21,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:21,686][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:22,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:22,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:23,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:23,633][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:23,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:24,280][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:24,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:24,928][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:25,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:25,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:26,230][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:26,554][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:26,877][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:27,200][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:27,523][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:27,846][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:28,170][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:28,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:29,154][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:29,473][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:29,797][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:30,454][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:31,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:31,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:32,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:32,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:32,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:32,869][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:33,803][__main__][INFO] - Iteration 194 took 22s (38.41% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 51m 36s. Estimated total time: 19h 1m 21s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 2s, 500 more iterations: 3h 10m 13s. [2025-11-13 09:13:33,806][__main__][INFO] - Starting iteration 194. [2025-11-13 09:13:33,808][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:33,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:41,966][__main__][INFO] - Number of regex retries in iteration 194: 0 [2025-11-13 09:13:41,966][__main__][INFO] - agents played in iteration 194 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:13:42,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:42,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:42,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:42,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:42,530][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:42,531][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:43,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:44,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:44,870][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:46,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:46,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:47,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:48,109][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:48,762][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:49,421][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:49,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:50,068][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:51,368][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:51,691][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:52,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:52,343][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:52,667][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:52,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:53,321][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:53,645][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:54,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:55,113][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:55,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:55,116][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:56,075][__main__][INFO] - Iteration 195 took 22s (36.63% Gen, 59.05% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 23m 16s. Estimated total time: 18h 33m 23s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 6s, 500 more iterations: 3h 5m 33s. [2025-11-13 09:13:56,077][__main__][INFO] - Starting iteration 195. [2025-11-13 09:13:56,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:56,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:04,519][__main__][INFO] - Number of regex retries in iteration 195: 0 [2025-11-13 09:14:04,520][__main__][INFO] - agents played in iteration 195 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:14:04,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:05,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:05,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:05,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:05,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:05,082][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:05,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:06,414][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:07,063][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:07,388][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:07,712][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:08,036][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:08,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:08,687][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:09,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:09,985][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:10,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:10,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:11,282][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:11,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:12,581][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:12,905][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:13,229][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:13,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:13,879][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:14,531][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:15,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:15,506][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:15,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:16,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:16,895][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:17,590][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:17,591][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:17,593][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:18,496][__main__][INFO] - Iteration 196 took 22s (37.65% Gen, 58.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 30m 21s. Estimated total time: 18h 40m 51s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 21s, 500 more iterations: 3h 6m 48s. [2025-11-13 09:14:18,498][__main__][INFO] - Starting iteration 196. [2025-11-13 09:14:18,501][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:18,501][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:27,108][__main__][INFO] - Number of regex retries in iteration 196: 0 [2025-11-13 09:14:27,108][__main__][INFO] - agents played in iteration 196 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:14:27,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:27,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:27,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:27,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:27,662][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:27,662][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:28,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:29,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:30,002][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:30,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:30,657][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:30,983][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:31,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:31,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:32,278][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:32,926][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:33,251][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:33,576][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:34,224][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:34,547][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:34,873][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:35,198][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:35,523][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:36,175][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:36,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:36,825][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:37,150][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:37,478][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:37,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:38,463][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:38,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:39,514][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:40,232][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:40,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:40,235][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:41,165][__main__][INFO] - Iteration 197 took 22s (37.97% Gen, 57.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 42m 24s. Estimated total time: 18h 53m 16s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 52s. [2025-11-13 09:14:41,167][__main__][INFO] - Starting iteration 197. [2025-11-13 09:14:41,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:41,171][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:49,058][__main__][INFO] - Number of regex retries in iteration 197: 0 [2025-11-13 09:14:49,059][__main__][INFO] - agents played in iteration 197 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:14:49,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:49,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:49,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:49,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:49,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:49,634][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:50,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:50,647][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:50,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:51,299][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:51,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:52,277][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:52,926][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:53,252][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:53,580][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:54,885][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:55,534][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:55,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:56,182][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:56,831][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:57,482][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:57,807][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:58,133][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:58,787][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:59,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:00,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:01,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:02,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:02,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:02,193][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:03,107][__main__][INFO] - Iteration 198 took 21s (35.96% Gen, 59.87% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 5m 38s. Estimated total time: 18h 16m 52s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 33s, 500 more iterations: 3h 2m 48s. [2025-11-13 09:15:03,109][__main__][INFO] - Starting iteration 198. [2025-11-13 09:15:03,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:15:03,113][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:11,434][__main__][INFO] - Number of regex retries in iteration 198: 0 [2025-11-13 09:15:11,435][__main__][INFO] - agents played in iteration 198 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:15:11,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:11,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:11,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:12,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:12,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:12,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:12,724][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:13,020][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:13,358][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:13,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:14,006][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:14,334][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:14,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:14,983][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:15,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:15,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:16,929][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:17,263][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:17,587][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:17,911][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:18,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:18,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:19,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:19,869][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:20,194][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:20,519][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:20,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:21,170][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:21,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:22,473][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:22,797][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:23,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:23,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:24,558][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:24,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:24,562][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:25,537][__main__][INFO] - Iteration 199 took 22s (37.11% Gen, 58.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 40s. Estimated total time: 18h 41m 16s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 22s, 500 more iterations: 3h 6m 52s. [2025-11-13 09:15:25,540][__main__][INFO] - Starting iteration 199. [2025-11-13 09:15:25,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:15:25,544][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:33,818][__main__][INFO] - Number of regex retries in iteration 199: 0 [2025-11-13 09:15:33,819][__main__][INFO] - agents played in iteration 199 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:15:34,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:34,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:34,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:34,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:34,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:34,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:35,115][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:35,411][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:35,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:36,069][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:36,398][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:37,052][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:37,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:37,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:38,028][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:38,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:39,330][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:39,655][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:39,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:40,307][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:40,632][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:41,283][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:41,934][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:42,263][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:42,588][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:42,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:43,244][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:43,573][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:44,226][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:44,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:44,878][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:45,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:45,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:46,292][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:47,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:47,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:47,009][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:47,926][__main__][INFO] - Iteration 200 took 22s (36.97% Gen, 58.92% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 12s. Estimated total time: 18h 39m 11s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 18s, 500 more iterations: 3h 6m 31s. [2025-11-13 09:15:47,928][__main__][INFO] - Starting iteration 200. [2025-11-13 09:15:47,930][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:15:47,931][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:56,135][__main__][INFO] - Number of regex retries in iteration 200: 0 [2025-11-13 09:15:56,136][__main__][INFO] - agents played in iteration 200 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:15:56,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:56,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:56,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:56,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:56,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:56,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:57,415][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:57,710][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:58,040][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:58,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:58,694][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:59,352][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:00,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:00,654][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:00,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:01,631][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:01,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:02,604][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:02,930][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:03,255][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:04,565][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:04,889][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:05,538][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:06,187][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:06,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:06,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:07,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:07,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:08,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:09,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:09,254][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:09,256][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:11,088][__main__][INFO] - Iteration 201 took 23s (35.43% Gen, 56.65% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 5m 33s. Estimated total time: 19h 17m 55s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 59s. [2025-11-13 09:16:11,090][__main__][INFO] - Starting iteration 201. [2025-11-13 09:16:11,093][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:11,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:19,177][__main__][INFO] - Number of regex retries in iteration 201: 0 [2025-11-13 09:16:19,178][__main__][INFO] - agents played in iteration 201 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:16:19,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:19,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:19,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:19,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:19,752][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:19,752][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:21,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:22,408][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:23,386][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:24,036][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:24,684][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:25,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:25,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:25,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:26,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:26,633][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:26,957][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:27,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:28,257][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:28,582][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:28,907][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:29,232][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:29,558][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:30,858][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:31,601][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:32,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:32,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:32,318][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:33,269][__main__][INFO] - Iteration 202 took 22s (36.45% Gen, 59.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 6s. Estimated total time: 18h 28m 50s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 57s, 500 more iterations: 3h 4m 48s. [2025-11-13 09:16:33,271][__main__][INFO] - Starting iteration 202. [2025-11-13 09:16:33,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:33,276][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:40,826][__main__][INFO] - Number of regex retries in iteration 202: 0 [2025-11-13 09:16:40,827][__main__][INFO] - agents played in iteration 202 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:16:41,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:41,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:41,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:41,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:41,393][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:41,393][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:42,128][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:42,424][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:42,749][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:43,072][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:43,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:43,722][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:44,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:44,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:45,021][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:45,349][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:45,676][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:46,327][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:46,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:46,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:47,300][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:47,624][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:47,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:48,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:48,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:49,247][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:49,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:49,900][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:50,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:50,545][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:51,523][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:51,848][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:52,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:53,218][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:53,932][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:53,935][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:53,936][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:54,889][__main__][INFO] - Iteration 203 took 21s (34.94% Gen, 60.65% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 47m 37s. Estimated total time: 18h 0m 43s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 1s, 500 more iterations: 3h 0m 7s. [2025-11-13 09:16:54,891][__main__][INFO] - Starting iteration 203. [2025-11-13 09:16:54,895][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:54,895][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:02,722][__main__][INFO] - Number of regex retries in iteration 203: 0 [2025-11-13 09:17:02,723][__main__][INFO] - agents played in iteration 203 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:17:03,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:03,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:03,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:03,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:03,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:03,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:04,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:04,336][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:04,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:04,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:05,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:05,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:05,952][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:06,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:06,933][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:07,253][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:07,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:07,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:08,228][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:08,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:09,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:09,847][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:10,171][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:10,821][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:11,144][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:11,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:11,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:12,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:12,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:13,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:13,423][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:13,750][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:14,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:15,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:15,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:15,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:15,830][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:16,794][__main__][INFO] - Iteration 204 took 21s (35.74% Gen, 59.85% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 1m 32s. Estimated total time: 18h 15m 0s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 30s, 500 more iterations: 3h 2m 30s. [2025-11-13 09:17:16,796][__main__][INFO] - Starting iteration 204. [2025-11-13 09:17:16,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:16,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:24,416][__main__][INFO] - Number of regex retries in iteration 204: 0 [2025-11-13 09:17:24,417][__main__][INFO] - agents played in iteration 204 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:17:24,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:24,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:24,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:24,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:24,978][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:24,978][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:26,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:26,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:28,331][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:29,304][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:29,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:30,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:30,604][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:30,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:31,260][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:31,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:32,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:32,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:35,488][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:35,812][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:36,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:36,859][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:37,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:37,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:37,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:38,526][__main__][INFO] - Iteration 205 took 21s (35.06% Gen, 60.67% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 52m 33s. Estimated total time: 18h 6m 23s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 12s, 500 more iterations: 3h 1m 3s. [2025-11-13 09:17:38,529][__main__][INFO] - Starting iteration 205. [2025-11-13 09:17:38,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:38,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:46,173][__main__][INFO] - Number of regex retries in iteration 205: 0 [2025-11-13 09:17:46,173][__main__][INFO] - agents played in iteration 205 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:17:46,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:46,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:46,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:46,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:46,741][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:46,742][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:47,479][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:47,774][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:48,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:48,426][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:48,752][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:49,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:49,726][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:50,050][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:50,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:50,698][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:51,671][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:52,002][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:52,325][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:52,973][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:53,298][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:53,621][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:53,949][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:55,570][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:55,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:56,220][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:56,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:57,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:57,845][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:58,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:59,284][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:59,287][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:59,289][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:00,215][__main__][INFO] - Iteration 206 took 21s (35.22% Gen, 60.49% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 49m 59s. Estimated total time: 18h 4m 10s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 8s, 500 more iterations: 3h 0m 41s. [2025-11-13 09:18:00,217][__main__][INFO] - Starting iteration 206. [2025-11-13 09:18:00,220][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:00,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:08,288][__main__][INFO] - Number of regex retries in iteration 206: 0 [2025-11-13 09:18:08,289][__main__][INFO] - agents played in iteration 206 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:18:08,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:08,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:08,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:08,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:08,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:08,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:09,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:09,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:10,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:10,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:11,537][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:12,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:12,506][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:12,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:13,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:13,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:13,803][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:14,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:14,780][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:15,104][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:15,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:15,753][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:16,078][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:16,403][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:16,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:17,389][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:17,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:18,045][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:18,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:19,018][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:19,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:19,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:20,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:21,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:21,441][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:21,443][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:22,394][__main__][INFO] - Iteration 207 took 22s (36.38% Gen, 59.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 14m 13s. Estimated total time: 18h 28m 46s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 57s, 500 more iterations: 3h 4m 47s. [2025-11-13 09:18:22,397][__main__][INFO] - Starting iteration 207. [2025-11-13 09:18:22,400][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:22,401][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:30,517][__main__][INFO] - Number of regex retries in iteration 207: 0 [2025-11-13 09:18:30,517][__main__][INFO] - agents played in iteration 207 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:18:30,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:31,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:31,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:31,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:31,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:31,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:32,137][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:32,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:33,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:33,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:33,760][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:34,087][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:35,073][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:35,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:35,729][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:36,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:36,383][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:36,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:37,032][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:37,359][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:37,685][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:38,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:38,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:38,670][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:38,994][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:39,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:39,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:40,295][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:40,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:41,269][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:41,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:41,918][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:42,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:42,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:43,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:43,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:43,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:44,672][__main__][INFO] - Iteration 208 took 22s (36.44% Gen, 59.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 18m 43s. Estimated total time: 18h 33m 39s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 7s, 500 more iterations: 3h 5m 36s. [2025-11-13 09:18:44,674][__main__][INFO] - Starting iteration 208. [2025-11-13 09:18:44,678][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:44,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:52,713][__main__][INFO] - Number of regex retries in iteration 208: 0 [2025-11-13 09:18:52,714][__main__][INFO] - agents played in iteration 208 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:18:53,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:53,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:53,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:53,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:53,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:53,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:54,044][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:54,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:55,314][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:55,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:55,963][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:56,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:56,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:56,937][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:58,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:58,567][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:58,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:59,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:59,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:00,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:00,512][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:00,837][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:01,162][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:01,488][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:01,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:02,142][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:02,791][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:03,115][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:03,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:03,764][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:04,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:04,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:05,148][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:05,856][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:05,858][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:05,860][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:06,899][__main__][INFO] - Iteration 209 took 22s (36.16% Gen, 59.16% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 48s. Estimated total time: 18h 31m 6s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 2s, 500 more iterations: 3h 5m 11s. [2025-11-13 09:19:06,901][__main__][INFO] - Starting iteration 209. [2025-11-13 09:19:06,904][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:19:06,905][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:14,993][__main__][INFO] - Number of regex retries in iteration 209: 0 [2025-11-13 09:19:14,994][__main__][INFO] - agents played in iteration 209 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:19:15,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:15,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:15,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:15,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:15,572][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:15,572][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:16,326][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:16,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:17,271][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:17,596][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:17,920][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:18,246][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:18,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:18,899][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:19,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:19,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:19,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:20,526][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:21,511][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:21,838][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:22,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:22,488][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:23,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:24,442][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:24,766][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:25,418][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:25,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:26,069][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:26,393][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:26,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:27,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:28,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:28,195][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:28,197][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:29,130][__main__][INFO] - Iteration 210 took 22s (36.39% Gen, 59.40% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 40s. Estimated total time: 18h 31m 21s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 2s, 500 more iterations: 3h 5m 13s. [2025-11-13 09:19:29,132][__main__][INFO] - Starting iteration 210. [2025-11-13 09:19:29,136][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:19:29,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:37,110][__main__][INFO] - Number of regex retries in iteration 210: 0 [2025-11-13 09:19:37,111][__main__][INFO] - agents played in iteration 210 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:19:37,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:37,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:37,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:37,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:37,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:37,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:38,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:39,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:40,072][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:40,397][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:41,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:42,358][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:42,684][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:43,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:43,668][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:44,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:44,645][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:44,969][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:45,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:45,947][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:46,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:46,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:47,577][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:47,901][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:48,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:48,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:48,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:49,602][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:50,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:50,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:50,315][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:52,218][__main__][INFO] - Iteration 211 took 23s (34.55% Gen, 57.20% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 58m 8s. Estimated total time: 19h 14m 11s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 21s. [2025-11-13 09:19:52,221][__main__][INFO] - Starting iteration 211. [2025-11-13 09:19:52,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:52,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:00,695][__main__][INFO] - Number of regex retries in iteration 211: 0 [2025-11-13 09:20:00,695][__main__][INFO] - agents played in iteration 211 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:20:01,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:01,275][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:02,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:02,655][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:03,634][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:03,962][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:04,287][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:04,612][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:04,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:05,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:05,586][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:05,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:06,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:06,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:06,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:07,871][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:08,531][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:08,855][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:09,181][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:09,507][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:10,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:10,809][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:11,134][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:11,459][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:11,784][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:12,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:12,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:13,163][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:13,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:13,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:13,915][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:14,888][__main__][INFO] - Iteration 212 took 22s (37.37% Gen, 58.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 36m 50s. Estimated total time: 18h 53m 16s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 52s. [2025-11-13 09:20:14,890][__main__][INFO] - Starting iteration 212. [2025-11-13 09:20:14,894][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:14,895][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:22,672][__main__][INFO] - Number of regex retries in iteration 212: 0 [2025-11-13 09:20:22,673][__main__][INFO] - agents played in iteration 212 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:20:23,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:23,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:23,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:23,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:23,249][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:23,249][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:24,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:24,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:24,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:24,979][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:25,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:25,625][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:25,950][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:26,275][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:26,598][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:26,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:27,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:27,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:28,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:28,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:29,530][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:29,855][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:30,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:30,841][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:31,495][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:32,146][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:32,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:32,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:33,131][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:34,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:34,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:35,160][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:35,902][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:35,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:35,905][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:36,924][__main__][INFO] - Iteration 213 took 22s (35.30% Gen, 60.06% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 4m 46s. Estimated total time: 18h 21m 34s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 43s, 500 more iterations: 3h 3m 35s. [2025-11-13 09:20:36,927][__main__][INFO] - Starting iteration 213. [2025-11-13 09:20:36,930][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:36,931][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:44,720][__main__][INFO] - Number of regex retries in iteration 213: 0 [2025-11-13 09:20:44,720][__main__][INFO] - agents played in iteration 213 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:20:45,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:45,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:45,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:45,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:45,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:45,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:46,071][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:46,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:46,696][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:47,021][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:47,350][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:47,993][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:48,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:49,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:49,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:49,946][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:51,238][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:51,891][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:52,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:52,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:53,192][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:53,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:53,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:54,493][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:54,818][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:55,470][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:56,120][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:56,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:57,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:57,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:57,927][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:57,928][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:58,922][__main__][INFO] - Iteration 214 took 21s (35.42% Gen, 60.06% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 2m 26s. Estimated total time: 18h 19m 36s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 39s, 500 more iterations: 3h 3m 16s. [2025-11-13 09:20:58,924][__main__][INFO] - Starting iteration 214. [2025-11-13 09:20:58,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:58,928][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:06,574][__main__][INFO] - Number of regex retries in iteration 214: 0 [2025-11-13 09:21:06,575][__main__][INFO] - agents played in iteration 214 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:21:07,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:07,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:07,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:07,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:07,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:07,154][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:07,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:08,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:08,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:08,865][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:09,194][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:09,524][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:09,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:10,173][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:10,498][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:10,823][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:11,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:11,472][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:12,447][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:12,772][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:14,399][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:14,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:15,054][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:15,701][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:16,029][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:16,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:17,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:17,664][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:17,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:18,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:19,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:19,795][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:19,797][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:19,799][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:20,836][__main__][INFO] - Iteration 215 took 21s (34.90% Gen, 60.36% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 57m 56s. Estimated total time: 18h 15m 28s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 30s, 500 more iterations: 3h 2m 34s. [2025-11-13 09:21:20,838][__main__][INFO] - Starting iteration 215. [2025-11-13 09:21:20,841][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:20,842][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:28,560][__main__][INFO] - Number of regex retries in iteration 215: 0 [2025-11-13 09:21:28,561][__main__][INFO] - agents played in iteration 215 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:21:29,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:29,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:29,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:29,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:29,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:29,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:29,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:30,504][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:30,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:31,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:32,141][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:32,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:32,791][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:33,116][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:33,441][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:33,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:34,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:34,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:34,741][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:35,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:35,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:36,371][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:36,701][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:37,026][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:37,356][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:37,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:38,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:38,330][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:38,655][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:38,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:39,307][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:39,959][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:40,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:41,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:41,769][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:41,770][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:41,772][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:42,762][__main__][INFO] - Iteration 216 took 21s (35.21% Gen, 60.26% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 58m 12s. Estimated total time: 18h 16m 6s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 32s, 500 more iterations: 3h 2m 41s. [2025-11-13 09:21:42,765][__main__][INFO] - Starting iteration 216. [2025-11-13 09:21:42,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:42,768][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:47,839][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2025-11-13 09:21:50,635][__main__][INFO] - Number of regex retries in iteration 216: 1 [2025-11-13 09:21:50,636][__main__][INFO] - agents played in iteration 216 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:21:51,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:51,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:51,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:51,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:51,208][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:51,209][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:51,970][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:52,270][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:52,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:53,242][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:53,899][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:54,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:55,523][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:55,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:56,498][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:56,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:57,147][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:57,479][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:57,799][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:58,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:58,453][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:58,784][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:59,103][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:59,753][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:00,403][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:00,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:01,053][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:01,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:02,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:02,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:03,125][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:03,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:03,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:03,838][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:04,817][__main__][INFO] - Iteration 217 took 22s (35.68% Gen, 59.87% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 4m 14s. Estimated total time: 18h 22m 30s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 45s, 500 more iterations: 3h 3m 45s. [2025-11-13 09:22:04,819][__main__][INFO] - Starting iteration 217. [2025-11-13 09:22:04,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:04,823][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:12,942][__main__][INFO] - Number of regex retries in iteration 217: 0 [2025-11-13 09:22:12,943][__main__][INFO] - agents played in iteration 217 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:22:13,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:13,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:13,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:13,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:13,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:13,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:14,232][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:14,529][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:15,830][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:16,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:16,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:17,473][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:18,122][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:18,775][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:19,100][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:19,425][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:19,750][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:20,406][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:21,059][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:21,386][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:21,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:22,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:22,383][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:22,703][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:23,028][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:23,354][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:24,004][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:24,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:25,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:26,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:26,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:26,121][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:27,111][__main__][INFO] - Iteration 218 took 22s (36.43% Gen, 59.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 52s. Estimated total time: 18h 34m 30s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 9s, 500 more iterations: 3h 5m 45s. [2025-11-13 09:22:27,113][__main__][INFO] - Starting iteration 218. [2025-11-13 09:22:27,116][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:27,117][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:34,912][__main__][INFO] - Number of regex retries in iteration 218: 0 [2025-11-13 09:22:34,913][__main__][INFO] - agents played in iteration 218 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:22:35,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:35,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:35,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:35,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:35,462][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:35,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:36,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:37,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:37,443][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:38,750][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:39,080][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:39,406][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:40,054][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:40,378][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:40,703][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:41,032][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:41,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:41,682][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:42,333][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:42,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:43,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:43,962][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:44,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:44,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:44,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:45,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:45,917][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:46,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:46,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:47,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:48,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:48,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:48,041][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:49,038][__main__][INFO] - Iteration 219 took 21s (35.56% Gen, 59.89% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 57m 7s. Estimated total time: 18h 16m 8s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 32s, 500 more iterations: 3h 2m 41s. [2025-11-13 09:22:49,040][__main__][INFO] - Starting iteration 219. [2025-11-13 09:22:49,044][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:49,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:57,179][__main__][INFO] - Number of regex retries in iteration 219: 0 [2025-11-13 09:22:57,180][__main__][INFO] - agents played in iteration 219 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:22:57,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:57,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:57,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:57,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:57,728][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:57,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:58,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:59,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:59,710][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:00,035][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:00,361][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:00,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:01,019][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:01,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:01,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:02,004][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:02,332][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:03,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:03,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:03,956][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:04,607][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:04,932][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:05,257][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:05,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:06,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:06,891][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:07,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:07,539][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:08,194][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:08,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:09,587][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:10,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:10,317][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:10,318][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:11,244][__main__][INFO] - Iteration 220 took 22s (36.64% Gen, 59.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 10m 40s. Estimated total time: 18h 30m 3s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 0s, 500 more iterations: 3h 5m 0s. [2025-11-13 09:23:11,246][__main__][INFO] - Starting iteration 220. [2025-11-13 09:23:11,249][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:23:11,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:18,749][__main__][INFO] - Number of regex retries in iteration 220: 0 [2025-11-13 09:23:18,749][__main__][INFO] - agents played in iteration 220 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:23:19,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:19,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:19,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:19,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:19,293][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:19,294][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:20,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:21,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:22,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:22,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:23,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:23,544][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:23,871][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:24,197][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:24,523][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:25,174][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:25,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:25,826][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:26,151][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:26,476][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:26,803][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:27,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:27,452][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:27,778][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:28,106][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:28,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:28,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:29,729][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:30,054][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:30,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:31,160][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:31,856][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:31,858][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:31,860][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:33,661][__main__][INFO] - Iteration 221 took 22s (33.46% Gen, 58.50% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 52s. Estimated total time: 18h 40m 37s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 21s, 500 more iterations: 3h 6m 46s. [2025-11-13 09:23:33,663][__main__][INFO] - Starting iteration 221. [2025-11-13 09:23:33,666][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:33,667][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:42,401][__main__][INFO] - Number of regex retries in iteration 221: 0 [2025-11-13 09:23:42,402][__main__][INFO] - agents played in iteration 221 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:23:42,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:42,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:42,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:42,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:42,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:42,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:43,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:44,272][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:44,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:44,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:45,250][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:45,577][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:45,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:46,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:46,890][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:47,862][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:48,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:48,517][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:49,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:49,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:49,815][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:50,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:51,121][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:51,446][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:51,770][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:52,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:52,747][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:53,072][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:53,397][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:54,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:54,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:55,504][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:55,507][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:55,509][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:56,428][__main__][INFO] - Iteration 222 took 22s (38.37% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 38m 3s. Estimated total time: 18h 58m 11s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 41s. [2025-11-13 09:23:56,431][__main__][INFO] - Starting iteration 222. [2025-11-13 09:23:56,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:56,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:04,613][__main__][INFO] - Number of regex retries in iteration 222: 0 [2025-11-13 09:24:04,614][__main__][INFO] - agents played in iteration 222 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:24:05,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:05,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:05,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:05,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:05,169][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:05,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:05,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:06,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:06,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:07,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:08,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:08,419][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:09,068][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:09,393][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:09,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:10,044][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:10,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:10,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:11,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:11,348][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:11,673][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:11,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:12,972][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:13,623][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:13,947][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:14,272][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:14,598][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:15,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:15,896][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:16,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:16,982][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:17,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:17,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:17,704][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:18,623][__main__][INFO] - Iteration 223 took 22s (36.86% Gen, 58.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 8m 59s. Estimated total time: 18h 29m 29s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 58s, 500 more iterations: 3h 4m 54s. [2025-11-13 09:24:18,625][__main__][INFO] - Starting iteration 223. [2025-11-13 09:24:18,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:18,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:26,869][__main__][INFO] - Number of regex retries in iteration 223: 0 [2025-11-13 09:24:26,869][__main__][INFO] - agents played in iteration 223 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:24:27,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:27,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:28,122][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:28,418][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:28,744][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:29,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:29,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:30,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:30,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:31,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:31,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:32,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:32,982][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:33,309][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:33,635][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:33,960][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:34,285][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:34,937][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:35,262][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:35,912][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:36,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:36,887][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:37,213][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:37,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:37,864][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:38,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:39,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:39,956][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:39,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:39,960][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:40,885][__main__][INFO] - Iteration 224 took 22s (37.03% Gen, 58.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 12m 3s. Estimated total time: 18h 32m 55s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 5s, 500 more iterations: 3h 5m 29s. [2025-11-13 09:24:40,887][__main__][INFO] - Starting iteration 224. [2025-11-13 09:24:40,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:40,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:49,192][__main__][INFO] - Number of regex retries in iteration 224: 0 [2025-11-13 09:24:49,193][__main__][INFO] - agents played in iteration 224 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:24:49,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:49,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:49,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:49,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:49,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:49,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:50,425][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:50,720][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:51,045][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:51,374][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:51,698][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:52,347][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:52,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:52,994][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:53,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:54,296][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:54,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:54,946][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:55,271][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:56,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:56,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:57,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:57,882][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:58,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:58,864][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:59,514][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:59,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:00,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:00,489][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:00,815][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:01,570][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:02,277][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:02,279][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:02,281][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:03,193][__main__][INFO] - Iteration 225 took 22s (37.22% Gen, 58.68% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 13m 58s. Estimated total time: 18h 35m 12s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 10s, 500 more iterations: 3h 5m 52s. [2025-11-13 09:25:03,195][__main__][INFO] - Starting iteration 225. [2025-11-13 09:25:03,198][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:03,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:11,501][__main__][INFO] - Number of regex retries in iteration 225: 0 [2025-11-13 09:25:11,501][__main__][INFO] - agents played in iteration 225 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:25:11,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:11,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:12,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:12,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:12,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:12,048][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:13,360][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:13,684][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:14,656][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:14,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:15,304][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:15,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:15,952][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:16,611][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:17,583][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:18,238][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:18,888][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:19,213][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:19,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:19,867][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:20,193][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:20,519][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:21,170][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:21,820][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:22,145][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:22,470][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:23,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:23,883][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:24,585][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:24,587][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:24,589][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:25,522][__main__][INFO] - Iteration 226 took 22s (37.19% Gen, 58.63% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 14m 36s. Estimated total time: 18h 36m 13s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 12s, 500 more iterations: 3h 6m 2s. [2025-11-13 09:25:25,525][__main__][INFO] - Starting iteration 226. [2025-11-13 09:25:25,528][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:25,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:34,440][__main__][INFO] - Number of regex retries in iteration 226: 0 [2025-11-13 09:25:34,441][__main__][INFO] - agents played in iteration 226 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:25:34,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:34,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:34,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:34,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:34,983][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:34,984][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:36,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:37,285][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:37,610][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:38,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:38,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:39,553][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:39,877][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:41,175][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:41,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:42,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:42,806][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:43,136][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:43,466][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:44,775][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:45,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:45,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:46,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:46,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:47,560][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:47,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:47,564][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:48,495][__main__][INFO] - Iteration 227 took 22s (38.80% Gen, 57.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 46m 25s. Estimated total time: 19h 8m 24s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 24s. [2025-11-13 09:25:48,497][__main__][INFO] - Starting iteration 227. [2025-11-13 09:25:48,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:48,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:57,592][__main__][INFO] - Number of regex retries in iteration 227: 0 [2025-11-13 09:25:57,592][__main__][INFO] - agents played in iteration 227 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:25:58,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:58,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:58,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:58,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:58,148][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:58,148][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:58,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:59,465][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:00,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:00,762][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:01,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:01,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:01,733][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:02,381][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:02,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:03,358][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:04,006][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:04,330][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:04,654][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:05,303][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:05,628][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:05,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:06,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:06,931][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:07,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:07,584][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:07,909][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:08,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:09,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:09,991][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:10,685][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:10,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:10,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:11,654][__main__][INFO] - Iteration 228 took 23s (39.26% Gen, 56.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 22s. Estimated total time: 19h 17m 45s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 57s. [2025-11-13 09:26:11,658][__main__][INFO] - Starting iteration 228. [2025-11-13 09:26:11,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:11,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:20,061][__main__][INFO] - Number of regex retries in iteration 228: 0 [2025-11-13 09:26:20,062][__main__][INFO] - agents played in iteration 228 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:26:20,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:20,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:20,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:20,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:20,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:20,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:21,311][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:21,607][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:22,583][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:22,907][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:23,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:23,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:24,201][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:24,524][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:25,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:25,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:25,822][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:26,147][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:26,795][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:27,117][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:27,765][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:28,090][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:28,414][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:28,739][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:29,064][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:29,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:29,717][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:30,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:30,375][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:30,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:31,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:32,444][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:33,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:33,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:33,139][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:34,036][__main__][INFO] - Iteration 229 took 22s (37.54% Gen, 58.44% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 2s. Estimated total time: 18h 38m 47s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 17s, 500 more iterations: 3h 6m 27s. [2025-11-13 09:26:34,038][__main__][INFO] - Starting iteration 229. [2025-11-13 09:26:34,041][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:34,041][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:42,628][__main__][INFO] - Number of regex retries in iteration 229: 0 [2025-11-13 09:26:42,628][__main__][INFO] - agents played in iteration 229 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:26:43,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,176][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:43,177][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:43,874][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:44,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:44,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:44,817][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:45,146][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:45,464][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:45,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:46,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:46,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:46,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:47,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:47,408][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:47,737][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:48,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:48,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:49,355][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:50,003][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:50,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:50,977][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:51,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:51,633][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:51,954][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:52,278][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:52,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:53,592][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:53,922][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:54,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:55,014][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:55,728][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:55,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:55,730][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:56,653][__main__][INFO] - Iteration 230 took 22s (37.97% Gen, 57.94% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 32s. Estimated total time: 18h 50m 39s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 41s, 500 more iterations: 3h 8m 26s. [2025-11-13 09:26:56,655][__main__][INFO] - Starting iteration 230. [2025-11-13 09:26:56,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:56,658][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:05,752][__main__][INFO] - Number of regex retries in iteration 230: 0 [2025-11-13 09:27:05,752][__main__][INFO] - agents played in iteration 230 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:27:06,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:06,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:06,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:06,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:06,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:06,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:07,074][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:07,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:08,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:08,347][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:08,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:09,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:09,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:10,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:10,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:10,946][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:11,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:11,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:11,920][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:12,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:12,579][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:12,902][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:13,227][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:13,876][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:14,200][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:14,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:14,850][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:15,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:15,841][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:16,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:16,500][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:17,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:18,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:18,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:18,959][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:18,961][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:20,724][__main__][INFO] - Iteration 231 took 24s (37.78% Gen, 54.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 39m 48s. Estimated total time: 20h 3m 20s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 6s, 500 more iterations: 3h 20m 33s. [2025-11-13 09:27:20,726][__main__][INFO] - Starting iteration 231. [2025-11-13 09:27:20,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:20,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:30,437][__main__][INFO] - Number of regex retries in iteration 231: 0 [2025-11-13 09:27:30,437][__main__][INFO] - agents played in iteration 231 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:27:30,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:31,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:31,001][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:31,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:31,698][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:31,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:32,318][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:32,969][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:33,621][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:33,948][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:34,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:34,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:35,255][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:35,581][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:35,904][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:36,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:36,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:36,879][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:37,860][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:38,839][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:39,165][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:39,491][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:40,142][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:40,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:41,121][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:42,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:42,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:43,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:43,592][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:43,594][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:44,510][__main__][INFO] - Iteration 232 took 23s (40.82% Gen, 55.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 11s. Estimated total time: 19h 49m 7s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 11s. [2025-11-13 09:27:44,512][__main__][INFO] - Starting iteration 232. [2025-11-13 09:27:44,515][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:44,516][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:54,026][__main__][INFO] - Number of regex retries in iteration 232: 0 [2025-11-13 09:27:54,027][__main__][INFO] - agents played in iteration 232 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:27:54,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:54,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:54,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:54,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:54,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:54,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:55,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:55,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:56,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:56,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:58,149][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:58,473][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:59,123][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:59,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:59,772][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:00,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:01,392][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:02,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:02,697][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:03,022][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:03,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:03,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:04,324][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:04,974][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:05,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:05,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:06,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:07,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:07,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:07,109][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:08,016][__main__][INFO] - Iteration 233 took 23s (40.47% Gen, 55.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 10m 46s. Estimated total time: 19h 35m 5s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 50s. [2025-11-13 09:28:08,018][__main__][INFO] - Starting iteration 233. [2025-11-13 09:28:08,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:08,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:17,202][__main__][INFO] - Number of regex retries in iteration 233: 0 [2025-11-13 09:28:17,203][__main__][INFO] - agents played in iteration 233 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:28:17,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,743][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:17,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:18,452][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:18,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:19,069][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:19,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:20,688][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:21,011][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:21,335][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:21,659][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:21,983][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:22,308][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:22,631][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:22,955][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:23,279][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:23,603][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:23,927][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:24,252][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:24,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:24,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:25,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:25,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:26,210][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:26,538][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:26,862][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:27,843][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:28,173][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:28,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:28,823][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:29,574][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:30,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:30,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:30,316][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:31,210][__main__][INFO] - Iteration 234 took 23s (39.59% Gen, 56.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 47s. Estimated total time: 19h 19m 30s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 15s. [2025-11-13 09:28:31,213][__main__][INFO] - Starting iteration 234. [2025-11-13 09:28:31,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:31,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:40,771][__main__][INFO] - Number of regex retries in iteration 234: 0 [2025-11-13 09:28:40,771][__main__][INFO] - agents played in iteration 234 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:28:41,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:41,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:41,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:41,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:41,323][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:41,323][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:42,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:42,311][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:42,635][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:43,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:43,929][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:44,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:44,902][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:45,225][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:45,549][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:46,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:46,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:47,171][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:47,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:48,146][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:48,469][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:49,117][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:49,446][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:49,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:50,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:50,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:51,070][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:51,395][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:51,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:52,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:52,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:53,155][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:53,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:53,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:53,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:54,785][__main__][INFO] - Iteration 235 took 23s (40.54% Gen, 55.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 25s. Estimated total time: 19h 38m 31s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 25s. [2025-11-13 09:28:54,787][__main__][INFO] - Starting iteration 235. [2025-11-13 09:28:54,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:54,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:03,530][__main__][INFO] - Number of regex retries in iteration 235: 0 [2025-11-13 09:29:03,531][__main__][INFO] - agents played in iteration 235 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:29:04,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:04,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:04,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:04,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:04,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:04,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:05,115][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:05,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:05,770][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:06,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:07,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:07,714][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:08,040][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:08,368][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:09,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:09,669][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:09,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:10,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:10,643][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:11,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:11,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:11,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:12,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:13,581][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:13,905][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:14,231][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:14,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:15,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:15,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:16,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:16,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:16,712][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:17,710][__main__][INFO] - Iteration 236 took 22s (38.13% Gen, 57.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 40m 35s. Estimated total time: 19h 6m 4s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 0s. [2025-11-13 09:29:17,713][__main__][INFO] - Starting iteration 236. [2025-11-13 09:29:17,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:17,717][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:27,317][__main__][INFO] - Number of regex retries in iteration 236: 0 [2025-11-13 09:29:27,318][__main__][INFO] - agents played in iteration 236 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:29:27,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:27,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:27,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:27,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:27,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:27,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:28,946][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:29,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:29,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:29,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:30,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:30,566][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:30,891][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:31,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:31,537][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:31,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:32,186][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:32,510][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:33,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:34,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:34,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:34,784][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:35,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:35,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:35,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:36,078][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:36,404][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:36,729][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:37,053][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:37,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:37,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:38,686][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:39,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:39,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:40,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:40,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:40,553][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:41,529][__main__][INFO] - Iteration 237 took 23s (40.31% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 24m 48s. Estimated total time: 19h 50m 40s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 26s. [2025-11-13 09:29:41,531][__main__][INFO] - Starting iteration 237. [2025-11-13 09:29:41,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:41,535][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:50,860][__main__][INFO] - Number of regex retries in iteration 237: 0 [2025-11-13 09:29:50,861][__main__][INFO] - agents played in iteration 237 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:29:51,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:51,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:51,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:51,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:51,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:51,433][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:52,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:52,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:52,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:53,134][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:53,458][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:53,782][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:54,430][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:54,756][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:55,407][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:55,733][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:56,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:57,043][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:57,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:58,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:59,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:59,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:00,294][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:00,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:01,271][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:01,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:01,920][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:02,244][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:02,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:03,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:04,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:04,187][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:04,188][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:05,177][__main__][INFO] - Iteration 238 took 23s (39.44% Gen, 56.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 56s. Estimated total time: 19h 42m 12s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 09:30:05,180][__main__][INFO] - Starting iteration 238. [2025-11-13 09:30:05,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:05,184][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:14,416][__main__][INFO] - Number of regex retries in iteration 238: 0 [2025-11-13 09:30:14,417][__main__][INFO] - agents played in iteration 238 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:30:14,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:14,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:14,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:15,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:15,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:15,008][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:16,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:16,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:16,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:17,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:17,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:18,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:18,652][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:19,301][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:19,625][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:19,950][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:20,282][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:20,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:21,580][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:21,899][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:22,547][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:22,879][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:23,520][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:24,174][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:24,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:25,473][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:25,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:26,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:26,845][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:27,621][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:27,623][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:27,624][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:28,617][__main__][INFO] - Iteration 239 took 23s (39.40% Gen, 56.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 3s. Estimated total time: 19h 31m 43s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 17s. [2025-11-13 09:30:28,619][__main__][INFO] - Starting iteration 239. [2025-11-13 09:30:28,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:28,622][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:37,530][__main__][INFO] - Number of regex retries in iteration 239: 0 [2025-11-13 09:30:37,531][__main__][INFO] - agents played in iteration 239 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:30:38,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:38,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:38,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:38,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:38,122][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:38,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:38,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:39,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:39,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:39,849][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:40,173][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:40,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:40,826][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:41,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:41,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:42,135][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:42,459][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:42,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:43,113][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:43,438][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:43,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:44,413][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:44,741][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:45,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:45,390][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:45,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:46,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:46,693][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:47,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:47,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:48,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:48,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:48,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:48,987][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:49,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:50,066][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:50,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:50,805][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:50,807][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:51,808][__main__][INFO] - Iteration 240 took 23s (38.42% Gen, 57.26% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 52m 17s. Estimated total time: 19h 19m 20s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 13s. [2025-11-13 09:30:51,810][__main__][INFO] - Starting iteration 240. [2025-11-13 09:30:51,813][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:51,814][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:01,241][__main__][INFO] - Number of regex retries in iteration 240: 0 [2025-11-13 09:31:01,242][__main__][INFO] - agents played in iteration 240 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:31:01,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:01,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:01,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:01,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:01,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:01,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:02,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:02,891][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:03,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:03,537][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:04,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:04,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:05,160][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:05,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:06,135][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:07,434][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:07,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:08,091][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:08,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:08,734][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:09,059][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:09,390][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:09,707][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:10,355][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:11,004][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:11,327][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:11,650][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:11,976][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:12,623][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:12,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:13,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:14,442][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:14,444][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:14,446][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:16,394][__main__][INFO] - Iteration 241 took 24s (38.35% Gen, 53.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 1m 36s. Estimated total time: 20h 29m 3s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 58s, 500 more iterations: 3h 24m 50s. [2025-11-13 09:31:16,396][__main__][INFO] - Starting iteration 241. [2025-11-13 09:31:16,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:16,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:26,088][__main__][INFO] - Number of regex retries in iteration 241: 0 [2025-11-13 09:31:26,089][__main__][INFO] - agents played in iteration 241 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:31:26,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:26,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:26,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:26,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:26,681][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:26,681][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:27,452][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:27,750][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:28,403][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:28,732][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:29,067][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:29,391][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:30,041][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:30,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:30,704][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:31,682][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:32,005][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:32,330][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:32,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:32,979][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:33,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:33,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:34,290][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:34,937][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:35,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:36,242][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:36,566][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:37,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:37,549][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:37,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:38,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:39,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:39,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:39,420][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:40,425][__main__][INFO] - Iteration 242 took 24s (40.33% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 33m 30s. Estimated total time: 20h 1m 22s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 2s, 500 more iterations: 3h 20m 13s. [2025-11-13 09:31:40,428][__main__][INFO] - Starting iteration 242. [2025-11-13 09:31:40,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:40,432][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:50,258][__main__][INFO] - Number of regex retries in iteration 242: 0 [2025-11-13 09:31:50,259][__main__][INFO] - agents played in iteration 242 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:31:50,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:50,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:50,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:50,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:50,838][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:50,839][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:51,608][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:52,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:52,555][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:53,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:53,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:54,178][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:54,503][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:54,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:55,151][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:55,477][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:55,804][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:57,102][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:57,426][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:57,751][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:58,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:58,400][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:58,724][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:59,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:59,371][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:59,695][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:00,018][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:00,344][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:00,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:01,327][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:01,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:01,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:02,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:03,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:03,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:03,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:04,523][__main__][INFO] - Iteration 243 took 24s (40.79% Gen, 55.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 36m 20s. Estimated total time: 20h 4m 36s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 46s. [2025-11-13 09:32:04,525][__main__][INFO] - Starting iteration 243. [2025-11-13 09:32:04,529][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:04,529][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:14,561][__main__][INFO] - Number of regex retries in iteration 243: 0 [2025-11-13 09:32:14,562][__main__][INFO] - agents played in iteration 243 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:32:15,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:15,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:15,924][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:16,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:17,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:18,179][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:18,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:19,155][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:19,488][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:20,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:20,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:20,802][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:21,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:21,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:22,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:22,762][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:23,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:23,409][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:23,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:24,059][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:24,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:24,711][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:25,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:26,013][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:26,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:27,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:27,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:27,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:27,841][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:28,859][__main__][INFO] - Iteration 244 took 24s (41.23% Gen, 54.58% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 47m 53s. Estimated total time: 20h 16m 33s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 33s, 500 more iterations: 3h 22m 45s. [2025-11-13 09:32:28,861][__main__][INFO] - Starting iteration 244. [2025-11-13 09:32:28,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:28,865][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:38,463][__main__][INFO] - Number of regex retries in iteration 244: 0 [2025-11-13 09:32:38,464][__main__][INFO] - agents played in iteration 244 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:32:38,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:38,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:39,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:39,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:39,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:39,068][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:39,868][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:40,164][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:40,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:41,138][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:41,463][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:42,113][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:42,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:43,425][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:43,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:44,072][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:45,694][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:46,022][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:46,349][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:46,673][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:46,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:47,327][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:47,653][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:48,305][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:48,626][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:48,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:49,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:49,608][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:49,928][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:50,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:51,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:51,822][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:51,824][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:51,826][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:52,806][__main__][INFO] - Iteration 245 took 23s (40.09% Gen, 55.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 2s. Estimated total time: 19h 57m 6s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 31s. [2025-11-13 09:32:52,808][__main__][INFO] - Starting iteration 245. [2025-11-13 09:32:52,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:52,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:02,826][__main__][INFO] - Number of regex retries in iteration 245: 0 [2025-11-13 09:33:02,827][__main__][INFO] - agents played in iteration 245 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:33:03,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:03,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:03,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:03,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:03,405][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:03,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:04,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:04,455][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:04,780][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:05,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:06,408][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:06,733][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:07,059][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:07,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:07,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:08,043][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:08,371][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:08,697][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:09,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:09,353][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:10,008][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:10,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:10,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:11,305][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:11,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:11,957][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:12,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:12,929][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:13,904][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:14,229][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:14,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:15,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:16,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:16,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:16,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:17,028][__main__][INFO] - Iteration 246 took 24s (41.35% Gen, 54.68% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 41m 23s. Estimated total time: 20h 10m 51s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 21s, 500 more iterations: 3h 21m 48s. [2025-11-13 09:33:17,030][__main__][INFO] - Starting iteration 246. [2025-11-13 09:33:17,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:17,033][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:26,783][__main__][INFO] - Number of regex retries in iteration 246: 0 [2025-11-13 09:33:26,784][__main__][INFO] - agents played in iteration 246 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:33:27,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:27,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:27,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:27,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:27,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:27,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:28,089][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:28,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:29,033][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:29,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:29,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:30,008][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:30,333][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:30,659][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:30,983][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:31,306][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:31,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:32,287][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:32,616][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:32,940][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:33,263][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:33,587][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:33,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:34,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:34,561][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:34,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:35,212][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:35,541][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:35,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:36,192][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:36,518][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:37,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:37,490][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:37,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:38,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:38,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:39,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:39,978][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:39,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:39,982][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:40,891][__main__][INFO] - Iteration 247 took 23s (40.87% Gen, 55.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 23m 7s. Estimated total time: 19h 52m 59s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 45s, 500 more iterations: 3h 18m 49s. [2025-11-13 09:33:40,893][__main__][INFO] - Starting iteration 247. [2025-11-13 09:33:40,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:40,896][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:50,612][__main__][INFO] - Number of regex retries in iteration 247: 0 [2025-11-13 09:33:50,613][__main__][INFO] - agents played in iteration 247 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:33:51,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:51,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:51,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:51,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:51,195][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:51,195][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:52,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:53,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:54,185][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:54,511][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:55,818][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:56,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:56,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:57,761][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:58,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:59,382][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:59,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:00,356][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:00,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:01,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:01,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:01,656][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:01,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:02,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:03,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:03,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:03,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:03,835][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:04,759][__main__][INFO] - Iteration 248 took 23s (40.71% Gen, 55.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 56s. Estimated total time: 19h 53m 12s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 52s. [2025-11-13 09:34:04,761][__main__][INFO] - Starting iteration 248. [2025-11-13 09:34:04,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:04,764][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:14,435][__main__][INFO] - Number of regex retries in iteration 248: 0 [2025-11-13 09:34:14,435][__main__][INFO] - agents played in iteration 248 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:34:14,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:14,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:14,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:15,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:15,003][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:15,003][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:15,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:16,034][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:16,688][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:17,018][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:18,321][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:18,972][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:19,630][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:19,957][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:20,606][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:20,932][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:21,257][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:21,583][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:21,915][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:22,236][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:22,895][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:23,225][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:23,545][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:23,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:24,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:25,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:25,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:26,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:26,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:27,685][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:27,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:27,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:28,700][__main__][INFO] - Iteration 249 took 23s (40.40% Gen, 55.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 11s. Estimated total time: 19h 56m 51s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 53s, 500 more iterations: 3h 19m 28s. [2025-11-13 09:34:28,702][__main__][INFO] - Starting iteration 249. [2025-11-13 09:34:28,704][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:28,705][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:38,643][__main__][INFO] - Number of regex retries in iteration 249: 0 [2025-11-13 09:34:38,644][__main__][INFO] - agents played in iteration 249 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:34:39,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:39,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:39,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:39,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:39,242][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:39,242][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:40,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:40,605][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:40,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:41,582][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:41,917][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:42,566][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:42,890][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:43,227][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:43,553][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:43,877][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:44,533][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:44,857][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:45,504][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:45,827][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:46,150][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:46,474][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:46,798][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:47,446][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:47,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:48,094][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:48,421][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:48,744][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:49,391][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:49,718][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:50,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:51,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:51,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:51,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:51,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:52,886][__main__][INFO] - Iteration 250 took 24s (41.10% Gen, 54.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 4s. Estimated total time: 20h 9m 8s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 18s, 500 more iterations: 3h 21m 31s. [2025-11-13 09:34:52,888][__main__][INFO] - Starting iteration 250. [2025-11-13 09:34:52,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:52,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:03,028][__main__][INFO] - Number of regex retries in iteration 250: 0 [2025-11-13 09:35:03,029][__main__][INFO] - agents played in iteration 250 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:35:03,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:03,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:03,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:03,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:03,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:03,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:04,356][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:04,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:04,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:05,634][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:05,954][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:06,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:07,256][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:07,583][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:07,909][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:08,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:09,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:09,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:11,151][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:12,123][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:12,448][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:12,776][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:13,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:14,405][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:14,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:15,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:16,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:16,208][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:16,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:18,156][__main__][INFO] - Iteration 251 took 25s (40.12% Gen, 52.17% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 31m 46s. Estimated total time: 21h 3m 15s. Time estimates for 10 more iterations: 4m 12s, 100 more iterations: 42m 6s, 500 more iterations: 3h 30m 32s. [2025-11-13 09:35:18,158][__main__][INFO] - Starting iteration 251. [2025-11-13 09:35:18,160][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:18,160][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:27,476][__main__][INFO] - Number of regex retries in iteration 251: 0 [2025-11-13 09:35:27,477][__main__][INFO] - agents played in iteration 251 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:35:27,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:27,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:28,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:28,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:28,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:28,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:28,808][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:30,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:30,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:30,737][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:31,062][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:31,388][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:32,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:32,368][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:32,693][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:33,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:33,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:34,970][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:35,622][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:35,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:36,597][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:36,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:37,568][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:37,891][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:38,544][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:39,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:39,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:40,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:40,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:40,644][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:41,653][__main__][INFO] - Iteration 252 took 23s (39.66% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 49s. Estimated total time: 19h 34m 41s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 46s. [2025-11-13 09:35:41,655][__main__][INFO] - Starting iteration 252. [2025-11-13 09:35:41,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:41,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:51,957][__main__][INFO] - Number of regex retries in iteration 252: 0 [2025-11-13 09:35:51,958][__main__][INFO] - agents played in iteration 252 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:35:52,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:52,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:52,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:52,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:52,521][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:52,521][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:53,600][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:53,926][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:54,250][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:54,576][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:56,207][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:57,846][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:58,172][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:58,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:58,834][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:59,159][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:59,484][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:59,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:00,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:01,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:01,766][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:02,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:02,749][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:03,398][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:03,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:04,447][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:05,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:05,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:05,226][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:06,243][__main__][INFO] - Iteration 253 took 24s (41.89% Gen, 53.97% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 56m 59s. Estimated total time: 20h 29m 16s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 58s, 500 more iterations: 3h 24m 52s. [2025-11-13 09:36:06,245][__main__][INFO] - Starting iteration 253. [2025-11-13 09:36:06,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:06,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:16,544][__main__][INFO] - Number of regex retries in iteration 253: 0 [2025-11-13 09:36:16,544][__main__][INFO] - agents played in iteration 253 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:36:17,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:17,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:17,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:17,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:17,128][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:17,128][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:17,890][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:18,186][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:18,512][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:18,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:19,161][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:19,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:19,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:20,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:21,118][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:21,776][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:22,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:23,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:23,401][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:24,058][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:24,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:25,043][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:25,366][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:25,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:26,013][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:26,668][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:27,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:27,650][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:27,976][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:28,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:29,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:29,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:29,790][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:29,792][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:30,823][__main__][INFO] - Iteration 254 took 24s (41.89% Gen, 53.91% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 56m 5s. Estimated total time: 20h 28m 46s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 57s, 500 more iterations: 3h 24m 47s. [2025-11-13 09:36:30,825][__main__][INFO] - Starting iteration 254. [2025-11-13 09:36:30,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:30,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:41,089][__main__][INFO] - Number of regex retries in iteration 254: 0 [2025-11-13 09:36:41,089][__main__][INFO] - agents played in iteration 254 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:36:41,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:41,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:41,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:41,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:41,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:41,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:42,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:42,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:43,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:44,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:45,019][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:45,343][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:45,667][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:46,316][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:46,963][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:47,290][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:47,612][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:48,261][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:49,879][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:50,204][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:50,526][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:50,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:51,501][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:51,826][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:52,152][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:52,475][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:52,802][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:53,528][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:54,286][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:54,287][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:54,289][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:55,303][__main__][INFO] - Iteration 255 took 24s (41.92% Gen, 53.93% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 50m 38s. Estimated total time: 20h 23m 45s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 47s, 500 more iterations: 3h 23m 57s. [2025-11-13 09:36:55,305][__main__][INFO] - Starting iteration 255. [2025-11-13 09:36:55,308][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:55,309][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:05,484][__main__][INFO] - Number of regex retries in iteration 255: 0 [2025-11-13 09:37:05,485][__main__][INFO] - agents played in iteration 255 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:37:05,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:05,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:06,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:06,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:06,062][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:06,062][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:07,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:07,770][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:08,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:08,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:09,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:09,394][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:09,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:10,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:10,363][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:11,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:11,334][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:11,657][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:11,990][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:12,305][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:12,951][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:13,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:14,245][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:14,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:15,218][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:15,867][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:16,190][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:16,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:17,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:17,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:18,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:18,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:18,658][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:19,666][__main__][INFO] - Iteration 256 took 24s (41.78% Gen, 54.08% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 44m 25s. Estimated total time: 20h 17m 55s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 35s, 500 more iterations: 3h 22m 59s. [2025-11-13 09:37:19,668][__main__][INFO] - Starting iteration 256. [2025-11-13 09:37:19,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:19,672][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:29,876][__main__][INFO] - Number of regex retries in iteration 256: 0 [2025-11-13 09:37:29,877][__main__][INFO] - agents played in iteration 256 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:37:30,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:30,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:30,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:30,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:30,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:30,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:31,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:31,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:31,886][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:32,215][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:32,540][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:32,863][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:33,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:33,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:34,490][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:34,816][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:35,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:35,802][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:36,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:36,450][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:36,776][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:37,105][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:37,428][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:37,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:38,078][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:38,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:38,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:39,377][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:39,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:40,029][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:40,352][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:40,676][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:40,999][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:41,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:42,392][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:43,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:43,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:43,167][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:44,173][__main__][INFO] - Iteration 257 took 24s (41.65% Gen, 54.24% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 51m 12s. Estimated total time: 20h 25m 7s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 50s, 500 more iterations: 3h 24m 11s. [2025-11-13 09:37:44,175][__main__][INFO] - Starting iteration 257. [2025-11-13 09:37:44,178][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:44,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:54,494][__main__][INFO] - Number of regex retries in iteration 257: 0 [2025-11-13 09:37:54,495][__main__][INFO] - agents played in iteration 257 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:37:54,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:54,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:55,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:55,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:55,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:55,065][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:55,860][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:56,156][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:56,485][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:57,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:57,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:57,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:58,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:58,437][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:58,761][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:59,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:59,414][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:59,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:00,059][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:00,383][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:00,711][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:01,036][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:01,360][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:01,683][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:02,010][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:02,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:02,979][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:03,306][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:03,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:04,272][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:04,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:05,243][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:05,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:05,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:06,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:06,949][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:07,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:07,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:07,734][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:08,771][__main__][INFO] - Iteration 258 took 24s (41.95% Gen, 53.83% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 55m 21s. Estimated total time: 20h 29m 41s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 59s, 500 more iterations: 3h 24m 56s. [2025-11-13 09:38:08,777][__main__][INFO] - Starting iteration 258. [2025-11-13 09:38:08,780][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:08,781][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:19,240][__main__][INFO] - Number of regex retries in iteration 258: 0 [2025-11-13 09:38:19,241][__main__][INFO] - agents played in iteration 258 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:38:19,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:19,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:19,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:19,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:19,826][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:19,826][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:20,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:21,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:21,890][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:22,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:24,162][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:24,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:24,818][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:25,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:26,764][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:27,089][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:27,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:28,393][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:28,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:29,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:29,698][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:30,028][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:30,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:31,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:31,697][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:32,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:32,464][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:32,466][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:33,467][__main__][INFO] - Iteration 259 took 24s (42.37% Gen, 53.57% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 59m 40s. Estimated total time: 20h 34m 24s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 8s, 500 more iterations: 3h 25m 44s. [2025-11-13 09:38:33,469][__main__][INFO] - Starting iteration 259. [2025-11-13 09:38:33,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:33,473][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:44,036][__main__][INFO] - Number of regex retries in iteration 259: 0 [2025-11-13 09:38:44,037][__main__][INFO] - agents played in iteration 259 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:38:44,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:44,610][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:44,611][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:45,400][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:45,698][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:46,351][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:46,675][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:47,002][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:47,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:47,650][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:47,977][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:48,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:48,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:48,948][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:49,599][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:49,925][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:50,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:50,900][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:51,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:51,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:52,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:52,526][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:52,855][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:53,186][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:53,506][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:53,830][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:54,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:54,484][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:54,806][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:55,130][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:55,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:56,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:57,320][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:57,321][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:57,323][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:58,343][__main__][INFO] - Iteration 260 took 24s (42.47% Gen, 53.42% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 8m 22s. Estimated total time: 20h 43m 31s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 27s, 500 more iterations: 3h 27m 15s. [2025-11-13 09:38:58,345][__main__][INFO] - Starting iteration 260. [2025-11-13 09:38:58,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:58,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:08,618][__main__][INFO] - Number of regex retries in iteration 260: 0 [2025-11-13 09:39:08,618][__main__][INFO] - agents played in iteration 260 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:39:09,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:09,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:09,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:09,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:09,197][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:09,198][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:10,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:10,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:11,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:12,225][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:12,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:13,198][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:13,522][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:13,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:14,821][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:15,796][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:16,122][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:16,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:17,751][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:18,404][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:19,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:19,374][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:20,023][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:20,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:21,072][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:21,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:21,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:21,829][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:23,921][__main__][INFO] - Iteration 261 took 25s (40.16% Gen, 51.66% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 43m 6s. Estimated total time: 21h 18m 41s. Time estimates for 10 more iterations: 4m 15s, 100 more iterations: 42m 37s, 500 more iterations: 3h 33m 6s. [2025-11-13 09:39:23,923][__main__][INFO] - Starting iteration 261. [2025-11-13 09:39:23,926][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:23,926][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:34,428][__main__][INFO] - Number of regex retries in iteration 261: 0 [2025-11-13 09:39:34,429][__main__][INFO] - agents played in iteration 261 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:39:34,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:34,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:34,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:34,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:34,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:34,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:35,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:37,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:37,457][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:37,781][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:38,108][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:38,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:38,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:39,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:39,734][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:40,059][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:40,383][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:40,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:41,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:42,336][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:42,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:42,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:43,312][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:43,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:44,289][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:45,265][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:46,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:47,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:47,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:47,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:47,830][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:48,832][__main__][INFO] - Iteration 262 took 24s (42.17% Gen, 53.81% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 9m 20s. Estimated total time: 20h 45m 20s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 30s, 500 more iterations: 3h 27m 33s. [2025-11-13 09:39:48,834][__main__][INFO] - Starting iteration 262. [2025-11-13 09:39:48,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:48,838][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:59,653][__main__][INFO] - Number of regex retries in iteration 262: 0 [2025-11-13 09:39:59,654][__main__][INFO] - agents played in iteration 262 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:40:00,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,255][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:00,256][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:00,999][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:01,624][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:01,943][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:02,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:02,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:03,240][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:03,564][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:03,888][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:04,212][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:04,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:05,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:05,511][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:05,835][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:06,159][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:06,482][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:06,806][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:07,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:07,777][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:08,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:08,424][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:08,747][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:09,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:09,719][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:10,042][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:10,366][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:11,014][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:11,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:12,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:12,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:12,841][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:12,843][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:13,832][__main__][INFO] - Iteration 263 took 24s (43.27% Gen, 52.76% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 13m 22s. Estimated total time: 20h 49m 47s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 39s, 500 more iterations: 3h 28m 17s. [2025-11-13 09:40:13,834][__main__][INFO] - Starting iteration 263. [2025-11-13 09:40:13,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:13,838][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:24,913][__main__][INFO] - Number of regex retries in iteration 263: 0 [2025-11-13 09:40:24,914][__main__][INFO] - agents played in iteration 263 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:40:25,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:25,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:25,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:25,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:25,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:25,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:26,222][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:26,525][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:26,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:27,164][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:27,488][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:27,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:28,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:28,458][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:29,432][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:29,757][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:30,082][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:30,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:30,730][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:31,053][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:31,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:31,702][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:32,675][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:33,322][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:33,646][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:33,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:34,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:35,588][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:36,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:37,327][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:38,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:38,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:38,073][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:38,978][__main__][INFO] - Iteration 264 took 25s (44.05% Gen, 52.34% Train). Generation: 11s, Training: 13s. Estimated remaining time: 19h 20m 15s. Estimated total time: 20h 57m 5s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 54s, 500 more iterations: 3h 29m 30s. [2025-11-13 09:40:38,980][__main__][INFO] - Starting iteration 264. [2025-11-13 09:40:38,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:38,983][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:49,717][__main__][INFO] - Number of regex retries in iteration 264: 0 [2025-11-13 09:40:49,717][__main__][INFO] - agents played in iteration 264 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:40:50,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:50,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:50,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:50,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:50,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:50,292][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:50,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:51,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:52,259][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:52,582][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:52,909][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:53,558][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:53,885][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:54,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:54,532][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:54,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:55,181][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:55,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:55,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:56,155][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:56,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:57,127][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:57,451][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:58,102][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:58,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:58,749][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:59,073][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:59,401][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:59,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:00,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:00,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:00,701][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:01,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:02,126][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:02,863][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:02,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:02,866][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:03,795][__main__][INFO] - Iteration 265 took 24s (43.26% Gen, 52.99% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 3m 26s. Estimated total time: 20h 40m 40s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 21s, 500 more iterations: 3h 26m 46s. [2025-11-13 09:41:03,798][__main__][INFO] - Starting iteration 265. [2025-11-13 09:41:03,800][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:03,801][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:14,197][__main__][INFO] - Number of regex retries in iteration 265: 0 [2025-11-13 09:41:14,198][__main__][INFO] - agents played in iteration 265 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:41:14,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:14,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:14,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:14,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:14,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:14,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:15,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:15,831][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:16,809][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:17,456][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:17,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:18,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:18,429][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:19,406][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:20,055][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:20,380][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:20,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:21,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:21,357][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:21,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:22,015][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:22,338][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:22,663][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:23,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:23,642][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:23,966][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:24,296][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:24,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:24,944][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:25,269][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:25,596][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:25,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:26,689][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:27,453][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:27,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:27,456][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:28,436][__main__][INFO] - Iteration 266 took 24s (42.20% Gen, 53.81% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 54m 10s. Estimated total time: 20h 31m 49s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 3s, 500 more iterations: 3h 25m 18s. [2025-11-13 09:41:28,438][__main__][INFO] - Starting iteration 266. [2025-11-13 09:41:28,440][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:28,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:39,029][__main__][INFO] - Number of regex retries in iteration 266: 0 [2025-11-13 09:41:39,030][__main__][INFO] - agents played in iteration 266 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:41:39,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:39,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:39,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:39,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:39,607][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:39,608][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:40,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:40,647][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:40,972][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:41,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:41,624][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:41,950][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:42,277][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:42,602][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:42,926][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:43,252][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:43,576][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:43,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:44,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:44,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:44,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:45,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:45,842][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:46,491][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:47,150][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:47,797][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:48,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:48,780][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:49,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:49,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:50,398][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:50,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:51,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:52,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:52,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:52,257][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:53,266][__main__][INFO] - Iteration 267 took 24s (42.65% Gen, 53.28% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 3m 15s. Estimated total time: 20h 41m 19s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 22s, 500 more iterations: 3h 26m 53s. [2025-11-13 09:41:53,268][__main__][INFO] - Starting iteration 267. [2025-11-13 09:41:53,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:53,272][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:04,079][__main__][INFO] - Number of regex retries in iteration 267: 0 [2025-11-13 09:42:04,080][__main__][INFO] - agents played in iteration 267 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:42:04,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:04,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:04,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:04,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:04,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:04,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:05,440][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:05,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:06,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:06,384][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:06,708][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:07,031][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:07,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:08,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:08,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:10,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:11,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:11,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:12,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:12,552][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:12,877][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:14,171][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:14,496][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:15,153][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:15,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:15,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:16,558][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:17,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:17,300][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:17,302][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:18,255][__main__][INFO] - Iteration 268 took 24s (43.26% Gen, 52.92% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 10m 43s. Estimated total time: 20h 49m 13s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 38s, 500 more iterations: 3h 28m 12s. [2025-11-13 09:42:18,257][__main__][INFO] - Starting iteration 268. [2025-11-13 09:42:18,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:42:18,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:28,862][__main__][INFO] - Number of regex retries in iteration 268: 0 [2025-11-13 09:42:28,863][__main__][INFO] - agents played in iteration 268 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:42:29,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:29,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:29,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:29,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:29,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:29,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:30,207][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:30,827][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:31,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:32,448][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:33,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:33,748][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:34,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:34,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:35,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:35,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:35,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:36,019][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:36,346][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:36,992][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:37,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:37,642][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:37,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:38,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:38,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:38,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:39,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:39,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:40,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:41,347][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:42,104][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:42,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:42,108][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:43,065][__main__][INFO] - Iteration 269 took 24s (42.74% Gen, 53.39% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 1m 25s. Estimated total time: 20h 40m 19s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 20s, 500 more iterations: 3h 26m 43s. [2025-11-13 09:42:43,067][__main__][INFO] - Starting iteration 269. [2025-11-13 09:42:43,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:42:43,070][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:53,460][__main__][INFO] - Number of regex retries in iteration 269: 0 [2025-11-13 09:42:53,460][__main__][INFO] - agents played in iteration 269 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:42:53,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:53,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:54,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:54,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:54,050][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:54,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:54,806][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:55,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:55,748][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:56,397][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:57,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:57,696][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:58,020][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:58,670][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:58,992][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:59,318][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:59,641][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:59,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:00,290][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:00,613][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:01,264][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:01,919][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:02,245][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:02,569][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:02,895][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:03,221][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:03,548][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:03,872][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:04,195][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:04,520][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:04,844][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:05,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:05,971][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:06,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:06,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:06,774][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:07,736][__main__][INFO] - Iteration 270 took 24s (42.12% Gen, 53.98% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 54m 1s. Estimated total time: 20h 33m 20s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 6s, 500 more iterations: 3h 25m 33s. [2025-11-13 09:43:07,738][__main__][INFO] - Starting iteration 270. [2025-11-13 09:43:07,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:43:07,743][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:17,821][__main__][INFO] - Number of regex retries in iteration 270: 0 [2025-11-13 09:43:17,821][__main__][INFO] - agents played in iteration 270 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:43:18,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:18,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:18,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:18,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:18,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:18,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:19,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:19,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:19,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:20,148][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:20,796][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:21,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:21,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:21,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:22,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:23,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:23,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:24,040][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:24,363][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:24,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:25,333][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:25,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:25,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:26,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:27,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:28,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:29,564][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:30,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:31,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:31,062][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:31,063][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:32,971][__main__][INFO] - Iteration 271 took 25s (39.95% Gen, 52.48% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 21m 46s. Estimated total time: 21h 1m 30s. Time estimates for 10 more iterations: 4m 12s, 100 more iterations: 42m 3s, 500 more iterations: 3h 30m 15s. [2025-11-13 09:43:32,973][__main__][INFO] - Starting iteration 271. [2025-11-13 09:43:32,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:32,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:43,500][__main__][INFO] - Number of regex retries in iteration 271: 0 [2025-11-13 09:43:43,501][__main__][INFO] - agents played in iteration 271 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:43:43,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:44,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:44,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:44,880][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:45,176][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:45,501][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:45,825][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:46,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:46,479][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:46,798][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:47,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:47,779][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:49,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:50,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:50,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:50,689][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:51,339][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:51,664][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:51,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:52,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:52,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:53,288][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:53,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:54,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:54,589][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:54,913][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:55,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:56,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:56,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:56,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:56,784][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:57,839][__main__][INFO] - Iteration 272 took 24s (42.33% Gen, 53.42% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 3m 3s. Estimated total time: 20h 43m 12s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 26s, 500 more iterations: 3h 27m 12s. [2025-11-13 09:43:57,841][__main__][INFO] - Starting iteration 272. [2025-11-13 09:43:57,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:57,845][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:08,526][__main__][INFO] - Number of regex retries in iteration 272: 0 [2025-11-13 09:44:08,527][__main__][INFO] - agents played in iteration 272 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:44:09,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:09,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:09,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:09,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:09,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:09,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:09,901][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:10,198][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:10,523][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:10,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:11,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:11,503][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:12,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:13,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:13,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:14,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:14,774][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:15,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:15,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:15,749][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:16,073][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:16,397][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:16,723][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:17,050][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:17,374][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:17,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:18,022][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:18,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:18,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:18,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:19,321][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:19,649][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:20,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:21,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:21,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:21,798][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:21,800][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:22,767][__main__][INFO] - Iteration 273 took 24s (42.86% Gen, 53.26% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 5m 34s. Estimated total time: 20h 46m 8s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 32s, 500 more iterations: 3h 27m 41s. [2025-11-13 09:44:22,769][__main__][INFO] - Starting iteration 273. [2025-11-13 09:44:22,772][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:22,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:33,912][__main__][INFO] - Number of regex retries in iteration 273: 0 [2025-11-13 09:44:33,913][__main__][INFO] - agents played in iteration 273 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:44:34,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:34,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:34,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:34,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:34,495][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:34,495][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:35,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:35,560][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:35,886][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:36,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:36,865][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:37,189][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:37,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:37,842][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:38,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:38,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:39,473][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:39,802][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:40,453][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:41,758][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:42,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:42,411][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:42,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:44,039][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:44,364][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:44,689][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:45,346][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:45,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:46,455][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:47,203][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:47,205][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:47,206][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:48,149][__main__][INFO] - Iteration 274 took 25s (43.90% Gen, 52.38% Train). Generation: 11s, Training: 13s. Estimated remaining time: 19h 27m 53s. Estimated total time: 21h 8m 53s. Time estimates for 10 more iterations: 4m 13s, 100 more iterations: 42m 17s, 500 more iterations: 3h 31m 28s. [2025-11-13 09:44:48,151][__main__][INFO] - Starting iteration 274. [2025-11-13 09:44:48,154][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:48,154][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:58,725][__main__][INFO] - Number of regex retries in iteration 274: 0 [2025-11-13 09:44:58,725][__main__][INFO] - agents played in iteration 274 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:44:59,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:59,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:59,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:59,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:59,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:59,318][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:00,375][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:00,700][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:01,023][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:01,351][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:01,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:02,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:02,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:02,988][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:03,631][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:03,958][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:04,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:04,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:05,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:05,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:06,555][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:06,885][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:07,205][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:07,856][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:08,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:08,828][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:09,477][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:09,802][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:10,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:10,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:11,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:11,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:11,976][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:11,978][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:12,989][__main__][INFO] - Iteration 275 took 24s (42.56% Gen, 53.36% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 0m 25s. Estimated total time: 20h 41m 49s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 23s, 500 more iterations: 3h 26m 58s. [2025-11-13 09:45:12,991][__main__][INFO] - Starting iteration 275. [2025-11-13 09:45:12,994][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:12,994][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:23,420][__main__][INFO] - Number of regex retries in iteration 275: 0 [2025-11-13 09:45:23,421][__main__][INFO] - agents played in iteration 275 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:45:23,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:23,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:23,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:24,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:24,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:24,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:25,061][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:25,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:25,713][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:26,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:26,360][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:26,689][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:27,008][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:27,332][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:27,657][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:28,305][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:28,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:28,953][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:29,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:29,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:30,249][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:30,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:30,899][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:31,224][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:31,548][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:31,879][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:32,197][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:32,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:33,175][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:33,493][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:34,466][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:34,790][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:35,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:35,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:36,712][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:36,713][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:36,715][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:37,689][__main__][INFO] - Iteration 276 took 24s (42.22% Gen, 53.83% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 53m 0s. Estimated total time: 20h 34m 49s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 9s, 500 more iterations: 3h 25m 48s. [2025-11-13 09:45:37,691][__main__][INFO] - Starting iteration 276. [2025-11-13 09:45:37,694][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:37,694][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:48,392][__main__][INFO] - Number of regex retries in iteration 276: 0 [2025-11-13 09:45:48,393][__main__][INFO] - agents played in iteration 276 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:45:48,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:48,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:48,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:48,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:48,981][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:48,982][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:49,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:50,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:50,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:50,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:51,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:51,689][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:52,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:52,661][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:52,986][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:53,310][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:53,635][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:53,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:54,616][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:54,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:55,264][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:55,918][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:56,568][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:57,218][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:58,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:58,525][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:58,848][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:59,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:59,496][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:59,820][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:00,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:00,926][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:01,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:01,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:01,674][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:02,641][__main__][INFO] - Iteration 277 took 24s (42.88% Gen, 53.24% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 5m 9s. Estimated total time: 20h 47m 23s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 34s, 500 more iterations: 3h 27m 53s. [2025-11-13 09:46:02,643][__main__][INFO] - Starting iteration 277. [2025-11-13 09:46:02,647][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:46:02,647][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:13,487][__main__][INFO] - Number of regex retries in iteration 277: 0 [2025-11-13 09:46:13,488][__main__][INFO] - agents played in iteration 277 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:46:13,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:14,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:14,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:14,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:14,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:14,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:14,863][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:15,814][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:16,140][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:16,470][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:16,795][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:17,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:17,445][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:17,780][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:18,106][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:18,430][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:18,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:19,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:19,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:19,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:20,720][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:21,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:21,368][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:21,692][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:22,341][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:22,664][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:23,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:23,962][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:24,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:24,610][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:24,936][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:25,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:26,049][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:26,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:26,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:26,805][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:27,780][__main__][INFO] - Iteration 278 took 25s (43.13% Gen, 52.99% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 14m 2s. Estimated total time: 20h 56m 41s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 53s, 500 more iterations: 3h 29m 26s. [2025-11-13 09:46:27,783][__main__][INFO] - Starting iteration 278. [2025-11-13 09:46:27,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:46:27,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:38,766][__main__][INFO] - Number of regex retries in iteration 278: 0 [2025-11-13 09:46:38,767][__main__][INFO] - agents played in iteration 278 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:46:39,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:39,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:39,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:39,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:39,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:39,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:40,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:40,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:41,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:41,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:42,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:42,357][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:42,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:43,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:44,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:44,661][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:45,319][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:45,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:45,969][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:46,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:47,602][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:47,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:48,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:48,902][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:49,227][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:49,557][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:49,882][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:50,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:50,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:51,283][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:52,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:52,045][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:52,047][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:52,955][__main__][INFO] - Iteration 279 took 25s (43.63% Gen, 52.76% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 15m 27s. Estimated total time: 20h 58m 31s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 57s, 500 more iterations: 3h 29m 45s. [2025-11-13 09:46:52,957][__main__][INFO] - Starting iteration 279. [2025-11-13 09:46:52,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:46:52,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:03,804][__main__][INFO] - Number of regex retries in iteration 279: 0 [2025-11-13 09:47:03,804][__main__][INFO] - agents played in iteration 279 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:47:04,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:04,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:04,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:04,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:04,377][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:04,377][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:05,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:05,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:05,757][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:06,083][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:06,414][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:06,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:07,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:07,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:08,043][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:08,367][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:08,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:09,020][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:09,347][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:09,995][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:10,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:10,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:10,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:11,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:11,939][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:12,586][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:12,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:13,234][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:13,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:13,883][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:14,208][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:14,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:14,857][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:15,183][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:15,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:16,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:17,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:17,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:17,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:17,956][__main__][INFO] - Iteration 280 took 24s (43.38% Gen, 52.86% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 6m 20s. Estimated total time: 20h 49m 49s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 39s, 500 more iterations: 3h 28m 18s. [2025-11-13 09:47:17,958][__main__][INFO] - Starting iteration 280. [2025-11-13 09:47:17,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:47:17,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:27,816][__main__][INFO] - Number of regex retries in iteration 280: 0 [2025-11-13 09:47:27,817][__main__][INFO] - agents played in iteration 280 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:47:28,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:28,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:28,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:28,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:28,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:28,396][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:29,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:29,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:29,779][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:30,109][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:30,440][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:30,765][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:31,743][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:32,067][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:32,390][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:32,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:33,039][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:33,686][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:34,012][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:35,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:36,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:36,610][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:36,936][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:37,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:37,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:37,907][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:38,235][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:39,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:40,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:41,034][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:41,036][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:41,037][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:42,815][__main__][INFO] - Iteration 281 took 24s (39.65% Gen, 53.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 52s. Estimated total time: 20h 42m 46s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 25s, 500 more iterations: 3h 27m 7s. [2025-11-13 09:47:42,818][__main__][INFO] - Starting iteration 281. [2025-11-13 09:47:42,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:42,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:53,163][__main__][INFO] - Number of regex retries in iteration 281: 0 [2025-11-13 09:47:53,163][__main__][INFO] - agents played in iteration 281 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:47:53,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,729][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:53,729][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:54,787][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:55,113][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:55,441][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:55,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:56,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:57,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:57,725][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:58,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:59,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:59,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:00,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:00,968][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:01,622][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:01,944][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:02,595][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:02,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:03,245][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:03,894][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:04,546][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:04,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:05,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:06,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:06,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:06,388][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:07,335][__main__][INFO] - Iteration 282 took 24s (42.19% Gen, 53.94% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 41m 27s. Estimated total time: 20h 25m 45s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 51s, 500 more iterations: 3h 24m 17s. [2025-11-13 09:48:07,337][__main__][INFO] - Starting iteration 282. [2025-11-13 09:48:07,340][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:07,340][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:18,793][__main__][INFO] - Number of regex retries in iteration 282: 0 [2025-11-13 09:48:18,794][__main__][INFO] - agents played in iteration 282 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:48:19,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:19,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:19,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:19,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:19,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:19,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:20,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:20,760][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:21,085][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:21,409][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:21,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:22,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:23,045][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:23,372][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:24,347][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:24,671][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:24,995][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:25,319][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:26,297][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:26,626][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:26,956][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:27,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:27,939][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:28,588][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:28,914][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:29,564][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:30,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:30,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:31,306][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:32,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:32,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:32,018][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:32,954][__main__][INFO] - Iteration 283 took 25s (44.72% Gen, 51.63% Train). Generation: 11s, Training: 13s. Estimated remaining time: 19h 36m 1s. Estimated total time: 21h 20m 45s. Time estimates for 10 more iterations: 4m 16s, 100 more iterations: 42m 41s, 500 more iterations: 3h 33m 27s. [2025-11-13 09:48:32,956][__main__][INFO] - Starting iteration 283. [2025-11-13 09:48:32,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:32,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:43,345][__main__][INFO] - Number of regex retries in iteration 283: 0 [2025-11-13 09:48:43,345][__main__][INFO] - agents played in iteration 283 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:48:43,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:43,926][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:43,926][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:44,683][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:44,978][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:45,308][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:45,960][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:46,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:46,608][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:47,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:48,249][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:48,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:49,225][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:49,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:49,875][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:50,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:51,176][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:51,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:52,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:53,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:53,453][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:53,778][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:54,427][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:55,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:55,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:56,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:56,571][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:56,573][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:57,528][__main__][INFO] - Iteration 284 took 24s (42.27% Gen, 53.84% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 43m 20s. Estimated total time: 20h 28m 29s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 56s, 500 more iterations: 3h 24m 44s. [2025-11-13 09:48:57,530][__main__][INFO] - Starting iteration 284. [2025-11-13 09:48:57,532][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:57,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:07,985][__main__][INFO] - Number of regex retries in iteration 284: 0 [2025-11-13 09:49:07,985][__main__][INFO] - agents played in iteration 284 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:49:08,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:08,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:08,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:08,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:08,544][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:08,544][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:09,925][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:10,905][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:11,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:11,559][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:12,528][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:12,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:13,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:13,827][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:14,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:14,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:14,801][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:15,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:16,101][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:16,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:16,749][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:17,079][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:17,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:17,731][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:18,063][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:18,381][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:18,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:19,034][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:19,368][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:19,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:20,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:21,163][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:21,164][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:21,166][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:22,084][__main__][INFO] - Iteration 285 took 24s (42.57% Gen, 53.69% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 42m 2s. Estimated total time: 20h 27m 36s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 55s, 500 more iterations: 3h 24m 36s. [2025-11-13 09:49:22,086][__main__][INFO] - Starting iteration 285. [2025-11-13 09:49:22,089][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:22,090][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:33,345][__main__][INFO] - Number of regex retries in iteration 285: 0 [2025-11-13 09:49:33,345][__main__][INFO] - agents played in iteration 285 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:49:33,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:33,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:33,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:33,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:33,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:33,917][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:34,704][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:34,999][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:35,325][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:35,657][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:35,981][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:36,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:36,953][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:37,277][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:37,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:38,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:38,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:38,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:40,208][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:40,862][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:42,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:42,812][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:44,447][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:44,773][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:45,103][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:45,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:46,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:46,831][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:46,833][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:47,823][__main__][INFO] - Iteration 286 took 25s (43.73% Gen, 52.41% Train). Generation: 11s, Training: 13s. Estimated remaining time: 19h 40m 46s. Estimated total time: 21h 26m 45s. Time estimates for 10 more iterations: 4m 17s, 100 more iterations: 42m 53s, 500 more iterations: 3h 34m 27s. [2025-11-13 09:49:47,828][__main__][INFO] - Starting iteration 286. [2025-11-13 09:49:47,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:47,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:58,703][__main__][INFO] - Number of regex retries in iteration 286: 0 [2025-11-13 09:49:58,703][__main__][INFO] - agents played in iteration 286 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:49:59,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:59,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:59,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:59,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:59,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:59,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:00,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:00,365][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:01,018][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:01,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:01,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:01,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:02,317][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:02,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:02,967][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:03,292][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:03,617][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:03,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:04,266][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:04,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:04,918][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:05,242][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:05,566][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:05,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:06,543][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:06,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:07,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:07,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:08,171][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:08,503][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:08,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:09,160][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:09,491][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:09,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:10,143][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:10,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:11,269][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:11,986][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:11,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:11,989][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:12,949][__main__][INFO] - Iteration 287 took 25s (43.28% Gen, 52.89% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 9m 33s. Estimated total time: 20h 55m 57s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 51s, 500 more iterations: 3h 29m 19s. [2025-11-13 09:50:12,951][__main__][INFO] - Starting iteration 287. [2025-11-13 09:50:12,954][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:50:12,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:23,748][__main__][INFO] - Number of regex retries in iteration 287: 0 [2025-11-13 09:50:23,749][__main__][INFO] - agents played in iteration 287 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:50:24,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:24,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:24,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:24,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:24,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:24,333][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:25,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:25,670][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:26,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:27,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:27,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:27,938][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:28,914][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:29,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:29,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:29,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:30,537][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:30,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:31,187][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:31,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:32,165][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:32,490][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:32,815][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:33,141][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:33,469][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:33,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:34,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:34,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:34,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:35,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:36,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:36,902][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:36,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:36,906][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:37,840][__main__][INFO] - Iteration 288 took 24s (43.37% Gen, 52.87% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 57m 33s. Estimated total time: 20h 44m 22s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 28s, 500 more iterations: 3h 27m 23s. [2025-11-13 09:50:37,842][__main__][INFO] - Starting iteration 288. [2025-11-13 09:50:37,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:50:37,845][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:48,524][__main__][INFO] - Number of regex retries in iteration 288: 0 [2025-11-13 09:50:48,525][__main__][INFO] - agents played in iteration 288 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:50:49,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:49,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:49,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:49,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:49,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:49,112][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:50,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:50,478][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:51,451][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:51,781][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:52,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:52,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:52,756][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:53,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:53,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:53,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:54,389][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:55,362][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:55,687][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:56,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:56,335][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:56,659][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:56,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:57,961][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:58,611][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:58,937][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:59,264][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:59,920][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:00,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:01,046][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:01,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:01,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:01,751][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:02,672][__main__][INFO] - Iteration 289 took 24s (43.01% Gen, 53.27% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 54m 11s. Estimated total time: 20h 41m 25s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 22s, 500 more iterations: 3h 26m 54s. [2025-11-13 09:51:02,675][__main__][INFO] - Starting iteration 289. [2025-11-13 09:51:02,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:51:02,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:13,457][__main__][INFO] - Number of regex retries in iteration 289: 0 [2025-11-13 09:51:13,458][__main__][INFO] - agents played in iteration 289 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:51:13,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:13,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:13,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:14,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:14,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:14,023][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:14,770][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:15,392][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:16,039][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:16,686][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:17,010][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:17,333][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:17,657][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:18,316][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:18,640][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:18,964][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:19,611][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:19,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:20,258][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:20,582][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:20,906][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:21,230][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:21,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:22,206][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:23,517][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:24,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:24,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:24,825][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:25,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:25,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:26,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:26,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:26,649][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:27,567][__main__][INFO] - Iteration 290 took 24s (43.31% Gen, 53.00% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 56m 52s. Estimated total time: 20h 44m 31s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 29s, 500 more iterations: 3h 27m 25s. [2025-11-13 09:51:27,569][__main__][INFO] - Starting iteration 290. [2025-11-13 09:51:27,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:51:27,572][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:38,112][__main__][INFO] - Number of regex retries in iteration 290: 0 [2025-11-13 09:51:38,112][__main__][INFO] - agents played in iteration 290 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:51:38,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:38,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:38,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:38,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:38,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:38,687][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:39,448][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:39,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:40,716][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:41,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:41,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:41,699][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:42,025][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:42,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:43,006][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:43,331][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:44,308][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:44,962][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:45,288][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:45,617][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:45,944][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:46,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:47,250][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:47,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:47,905][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:48,230][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:48,555][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:49,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:50,628][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:51,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:51,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:51,341][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:53,197][__main__][INFO] - Iteration 291 took 25s (41.13% Gen, 51.62% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 33m 15s. Estimated total time: 21h 21m 19s. Time estimates for 10 more iterations: 4m 16s, 100 more iterations: 42m 42s, 500 more iterations: 3h 33m 33s. [2025-11-13 09:51:53,200][__main__][INFO] - Starting iteration 291. [2025-11-13 09:51:53,205][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:53,205][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:03,889][__main__][INFO] - Number of regex retries in iteration 291: 0 [2025-11-13 09:52:03,889][__main__][INFO] - agents played in iteration 291 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:52:04,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:04,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:04,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:04,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:04,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:04,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:05,204][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:05,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:05,828][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:06,490][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:06,820][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:07,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:07,473][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:07,808][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:08,788][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:09,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:09,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:10,102][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:10,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:10,760][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:11,087][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:11,414][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:11,744][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:12,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:12,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:13,048][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:13,372][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:13,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:14,349][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:14,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:15,000][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:15,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:16,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:17,138][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:17,139][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:17,141][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:18,046][__main__][INFO] - Iteration 292 took 24s (43.00% Gen, 53.34% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 53m 38s. Estimated total time: 20h 42m 8s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 24s, 500 more iterations: 3h 27m 1s. [2025-11-13 09:52:18,048][__main__][INFO] - Starting iteration 292. [2025-11-13 09:52:18,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:18,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:28,618][__main__][INFO] - Number of regex retries in iteration 292: 0 [2025-11-13 09:52:28,619][__main__][INFO] - agents played in iteration 292 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:52:29,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:29,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:29,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:29,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:29,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:29,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:29,913][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:30,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:31,507][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:32,158][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:32,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:34,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:34,765][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:35,089][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:35,414][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:35,740][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:36,392][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:36,718][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:37,043][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:37,369][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:37,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:38,021][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:39,001][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:39,326][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:39,651][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:39,978][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:40,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:41,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:41,776][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:41,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:41,778][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:42,702][__main__][INFO] - Iteration 293 took 24s (42.87% Gen, 53.38% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 43m 44s. Estimated total time: 20h 32m 38s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 5s, 500 more iterations: 3h 25m 26s. [2025-11-13 09:52:42,704][__main__][INFO] - Starting iteration 293. [2025-11-13 09:52:42,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:42,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:52,194][__main__][INFO] - Number of regex retries in iteration 293: 0 [2025-11-13 09:52:52,195][__main__][INFO] - agents played in iteration 293 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:52:52,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:52,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:52,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:52,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:52,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:52,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:53,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:54,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:54,785][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:55,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:55,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:55,764][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:56,090][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:56,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:56,745][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:57,070][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:58,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:58,696][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:59,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:59,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:01,626][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:01,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:02,278][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:02,930][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:03,256][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:03,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:03,910][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:04,682][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:05,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:05,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:05,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:06,351][__main__][INFO] - Iteration 294 took 23s (40.12% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 58s. Estimated total time: 19h 42m 15s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 09:53:06,353][__main__][INFO] - Starting iteration 294. [2025-11-13 09:53:06,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:06,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:16,378][__main__][INFO] - Number of regex retries in iteration 294: 0 [2025-11-13 09:53:16,379][__main__][INFO] - agents played in iteration 294 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:53:16,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:16,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:16,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:16,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:16,941][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:16,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:17,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:17,988][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:18,313][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:18,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:19,287][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:19,939][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:20,270][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:20,595][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:20,921][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:22,550][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:23,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:23,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:24,177][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:24,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:25,156][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:25,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:26,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:26,799][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:27,129][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:27,458][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:27,785][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:28,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:28,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:29,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:29,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:29,607][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:30,571][__main__][INFO] - Iteration 295 took 24s (41.39% Gen, 54.62% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 21m 8s. Estimated total time: 20h 10m 50s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 21s, 500 more iterations: 3h 21m 48s. [2025-11-13 09:53:30,573][__main__][INFO] - Starting iteration 295. [2025-11-13 09:53:30,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:30,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:40,250][__main__][INFO] - Number of regex retries in iteration 295: 0 [2025-11-13 09:53:40,251][__main__][INFO] - agents played in iteration 295 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:53:40,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:40,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:40,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:40,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:40,815][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:40,816][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:41,870][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:42,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:42,523][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:42,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:43,495][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:43,826][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:44,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:45,140][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:45,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:45,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:46,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:46,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:46,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:47,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:47,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:48,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:48,387][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:49,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:49,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:49,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:50,347][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:50,667][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:51,318][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:51,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:52,724][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:53,445][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:53,446][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:53,448][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:54,391][__main__][INFO] - Iteration 296 took 23s (40.62% Gen, 55.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 0m 42s. Estimated total time: 19h 50m 47s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 27s. [2025-11-13 09:53:54,393][__main__][INFO] - Starting iteration 296. [2025-11-13 09:53:54,396][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:54,397][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:04,152][__main__][INFO] - Number of regex retries in iteration 296: 0 [2025-11-13 09:54:04,153][__main__][INFO] - agents played in iteration 296 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:54:04,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:04,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:04,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:04,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:04,715][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:04,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:05,490][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:06,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:06,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:06,757][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:07,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:07,737][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:08,065][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:08,394][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:09,048][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:09,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:10,684][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:11,010][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:11,343][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:11,661][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:11,986][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:12,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:13,289][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:13,948][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:14,266][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:14,916][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:15,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:15,566][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:15,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:16,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:17,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:17,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:17,405][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:18,367][__main__][INFO] - Iteration 297 took 23s (40.69% Gen, 55.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 6s. Estimated total time: 19h 58m 36s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 46s. [2025-11-13 09:54:18,369][__main__][INFO] - Starting iteration 297. [2025-11-13 09:54:18,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:54:18,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:28,397][__main__][INFO] - Number of regex retries in iteration 297: 0 [2025-11-13 09:54:28,398][__main__][INFO] - agents played in iteration 297 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:54:28,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:28,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:28,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:28,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:28,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:28,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:29,706][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:30,009][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:30,659][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:30,984][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:31,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:31,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:31,961][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:32,286][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:32,611][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:32,936][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:33,262][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:33,587][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:33,917][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:34,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:34,569][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:34,895][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:35,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:35,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:36,198][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:36,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:36,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:37,176][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:37,837][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:38,162][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:38,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:38,812][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:39,469][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:39,795][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:40,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:40,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:41,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:41,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:41,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:42,542][__main__][INFO] - Iteration 298 took 24s (41.47% Gen, 54.71% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 17m 36s. Estimated total time: 20h 8m 30s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 17s, 500 more iterations: 3h 21m 25s. [2025-11-13 09:54:42,544][__main__][INFO] - Starting iteration 298. [2025-11-13 09:54:42,546][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:54:42,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:52,114][__main__][INFO] - Number of regex retries in iteration 298: 0 [2025-11-13 09:54:52,114][__main__][INFO] - agents played in iteration 298 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:54:52,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:52,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:52,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:52,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:52,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:52,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:54,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:54,700][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:55,038][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:55,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:55,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:56,350][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:56,680][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:57,006][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:57,330][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:57,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:57,984][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:58,963][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:59,289][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:59,614][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:59,940][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:00,922][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:01,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:02,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:02,548][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:02,874][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:03,199][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:03,525][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:03,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:04,626][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:05,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:05,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:05,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:06,261][__main__][INFO] - Iteration 299 took 23s (40.34% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 31s. Estimated total time: 19h 45m 48s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 38s. [2025-11-13 09:55:06,263][__main__][INFO] - Starting iteration 299. [2025-11-13 09:55:06,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:55:06,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:15,270][__main__][INFO] - Number of regex retries in iteration 299: 0 [2025-11-13 09:55:15,271][__main__][INFO] - agents played in iteration 299 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:55:15,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:15,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:15,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:15,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:15,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:15,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:16,891][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:17,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:18,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:18,524][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:19,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:19,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:19,837][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:20,489][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:21,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:21,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:21,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:22,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:22,446][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:23,426][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:23,761][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:25,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:25,389][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:25,721][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:26,047][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:27,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:27,801][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:28,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:28,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:28,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:29,511][__main__][INFO] - Iteration 300 took 23s (38.73% Gen, 57.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 36s. Estimated total time: 19h 22m 17s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 42s. [2025-11-13 09:55:29,513][__main__][INFO] - Starting iteration 300. [2025-11-13 09:55:29,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:55:29,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:38,911][__main__][INFO] - Number of regex retries in iteration 300: 0 [2025-11-13 09:55:38,911][__main__][INFO] - agents played in iteration 300 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:55:39,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:39,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:39,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:39,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:39,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:39,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:40,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:40,867][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:41,192][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:41,848][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:42,506][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:43,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:43,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:44,129][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:45,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:45,761][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:46,086][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:46,411][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:46,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:47,074][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:47,399][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:48,714][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:49,040][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:49,371][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:49,703][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:50,027][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:50,353][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:50,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:51,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:52,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:52,130][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:52,131][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:53,932][__main__][INFO] - Iteration 301 took 24s (38.48% Gen, 54.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 46s. Estimated total time: 20h 20m 51s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 41s, 500 more iterations: 3h 23m 28s. [2025-11-13 09:55:53,934][__main__][INFO] - Starting iteration 301. [2025-11-13 09:55:53,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:53,937][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:03,105][__main__][INFO] - Number of regex retries in iteration 301: 0 [2025-11-13 09:56:03,105][__main__][INFO] - agents played in iteration 301 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:56:03,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:03,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:03,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:04,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:04,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:04,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:04,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:05,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:06,048][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:07,025][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:07,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:08,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:08,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:09,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:09,955][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:10,280][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:11,584][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:12,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:13,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:13,882][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:14,200][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:14,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:15,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:15,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:16,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:16,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:16,629][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:17,547][__main__][INFO] - Iteration 302 took 23s (38.83% Gen, 57.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 5s. Estimated total time: 19h 40m 34s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 45s. [2025-11-13 09:56:17,549][__main__][INFO] - Starting iteration 302. [2025-11-13 09:56:17,552][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:17,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:26,913][__main__][INFO] - Number of regex retries in iteration 302: 0 [2025-11-13 09:56:26,913][__main__][INFO] - agents played in iteration 302 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:56:27,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:27,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:27,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:27,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:27,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:27,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:28,248][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:28,543][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:28,878][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:29,203][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:29,530][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:29,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:30,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:30,837][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:31,818][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:32,145][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:32,476][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:33,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:33,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:34,098][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:34,424][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:34,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:35,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:35,406][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:35,736][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:36,063][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:36,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:36,718][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:37,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:37,367][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:38,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:38,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:39,447][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:40,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:40,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:40,146][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:41,065][__main__][INFO] - Iteration 303 took 23s (39.81% Gen, 56.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 50s. Estimated total time: 19h 35m 42s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 57s. [2025-11-13 09:56:41,067][__main__][INFO] - Starting iteration 303. [2025-11-13 09:56:41,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:41,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:50,815][__main__][INFO] - Number of regex retries in iteration 303: 0 [2025-11-13 09:56:50,815][__main__][INFO] - agents played in iteration 303 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:56:51,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:51,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:51,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:51,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:51,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:51,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:52,170][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:52,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:53,116][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:53,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:53,766][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:54,744][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:55,725][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:56,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:56,376][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:56,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:57,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:57,351][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:57,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:58,003][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:58,653][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:58,979][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:59,306][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:59,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:59,957][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:00,282][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:00,935][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:01,260][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:01,912][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:02,238][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:02,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:03,370][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:04,071][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:04,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:04,074][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:05,062][__main__][INFO] - Iteration 304 took 23s (40.61% Gen, 55.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 6m 19s. Estimated total time: 19h 59m 35s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 55s. [2025-11-13 09:57:05,064][__main__][INFO] - Starting iteration 304. [2025-11-13 09:57:05,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:05,069][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:14,315][__main__][INFO] - Number of regex retries in iteration 304: 0 [2025-11-13 09:57:14,315][__main__][INFO] - agents played in iteration 304 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:57:14,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:14,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:14,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:14,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:14,897][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:14,898][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:15,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:15,978][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:16,310][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:16,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:17,609][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:18,255][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:18,580][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:18,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:19,231][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:19,556][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:19,882][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:20,209][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:20,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:21,181][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:21,512][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:21,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:22,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:23,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:24,124][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:24,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:25,102][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:25,428][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:25,755][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:26,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:26,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:27,562][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:27,564][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:27,566][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:28,547][__main__][INFO] - Iteration 305 took 23s (39.38% Gen, 56.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 19s. Estimated total time: 19h 33m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 39s. [2025-11-13 09:57:28,549][__main__][INFO] - Starting iteration 305. [2025-11-13 09:57:28,552][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:28,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:37,926][__main__][INFO] - Number of regex retries in iteration 305: 0 [2025-11-13 09:57:37,927][__main__][INFO] - agents played in iteration 305 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:57:38,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:38,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:38,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:38,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:38,536][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:38,536][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:39,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:39,618][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:39,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:40,277][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:40,602][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:41,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:41,909][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:42,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:42,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:43,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:43,874][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:44,199][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:44,851][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:45,188][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:45,515][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:46,492][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:46,817][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:47,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:47,468][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:47,794][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:48,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:48,444][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:49,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:49,425][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:49,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:50,525][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:51,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:51,252][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:51,254][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:52,281][__main__][INFO] - Iteration 306 took 23s (39.50% Gen, 56.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 25s. Estimated total time: 19h 46m 29s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 44s. [2025-11-13 09:57:52,283][__main__][INFO] - Starting iteration 306. [2025-11-13 09:57:52,287][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:52,288][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:01,946][__main__][INFO] - Number of regex retries in iteration 306: 0 [2025-11-13 09:58:01,946][__main__][INFO] - agents played in iteration 306 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:58:02,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:02,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:02,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:02,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:02,557][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:02,557][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:03,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:04,257][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:04,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:05,888][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:06,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:06,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:06,866][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:07,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:07,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:08,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:08,825][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:09,152][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:09,478][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:09,803][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:10,455][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:10,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:11,106][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:11,432][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:11,757][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:12,082][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:12,409][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:12,733][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:13,059][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:13,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:14,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:15,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:15,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:15,217][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:16,207][__main__][INFO] - Iteration 307 took 23s (40.37% Gen, 55.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 1m 35s. Estimated total time: 19h 56m 2s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 20s. [2025-11-13 09:58:16,209][__main__][INFO] - Starting iteration 307. [2025-11-13 09:58:16,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:58:16,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:25,713][__main__][INFO] - Number of regex retries in iteration 307: 0 [2025-11-13 09:58:25,714][__main__][INFO] - agents played in iteration 307 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:58:26,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:26,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:27,059][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:27,680][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:28,009][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:28,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:28,653][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:30,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:30,614][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:30,939][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:31,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:31,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:32,573][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:32,898][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:33,557][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:34,209][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:34,535][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:34,866][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:35,190][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:35,516][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:35,841][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:36,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:36,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:37,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:37,480][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:38,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:38,954][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:38,955][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:38,957][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:39,959][__main__][INFO] - Iteration 308 took 23s (40.01% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 31s. Estimated total time: 19h 47m 23s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 53s. [2025-11-13 09:58:39,961][__main__][INFO] - Starting iteration 308. [2025-11-13 09:58:39,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:58:39,965][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:49,321][__main__][INFO] - Number of regex retries in iteration 308: 0 [2025-11-13 09:58:49,322][__main__][INFO] - agents played in iteration 308 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:58:49,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:49,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:50,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:50,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:51,302][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:51,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:51,951][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:52,275][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:52,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:52,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:54,553][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:54,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:55,206][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:55,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:55,855][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:56,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:57,157][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:57,483][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:57,807][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:58,133][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:58,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:58,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:59,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:00,085][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:00,410][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:01,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:01,788][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:02,515][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:02,516][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:02,518][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:03,577][__main__][INFO] - Iteration 309 took 23s (39.62% Gen, 55.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 25s. Estimated total time: 19h 40m 40s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 46s. [2025-11-13 09:59:03,580][__main__][INFO] - Starting iteration 309. [2025-11-13 09:59:03,583][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:59:03,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:13,259][__main__][INFO] - Number of regex retries in iteration 309: 0 [2025-11-13 09:59:13,260][__main__][INFO] - agents played in iteration 309 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:59:13,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:13,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:13,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:13,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:13,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:13,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:14,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:15,223][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:15,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:15,873][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:16,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:16,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:16,848][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:17,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:17,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:17,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:19,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:20,110][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:20,435][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:20,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:21,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:21,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:22,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:23,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:23,370][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:23,694][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:24,676][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:25,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:25,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:26,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:26,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:26,493][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:27,478][__main__][INFO] - Iteration 310 took 23s (40.50% Gen, 55.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 9s. Estimated total time: 19h 54m 48s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 8s. [2025-11-13 09:59:27,482][__main__][INFO] - Starting iteration 310. [2025-11-13 09:59:27,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:59:27,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:36,472][__main__][INFO] - Number of regex retries in iteration 310: 0 [2025-11-13 09:59:36,473][__main__][INFO] - agents played in iteration 310 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 09:59:36,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:36,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:37,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:37,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:37,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:37,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:38,125][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:38,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:39,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:39,424][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:39,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:40,075][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:40,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:41,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:41,707][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:42,032][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:42,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:42,684][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:43,015][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:43,993][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:44,318][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:44,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:45,296][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:45,621][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:46,274][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:46,599][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:46,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:47,249][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:47,574][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:48,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:48,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:49,722][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:49,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:49,725][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:51,805][__main__][INFO] - Iteration 311 took 24s (36.95% Gen, 54.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 58s. Estimated total time: 20h 16m 1s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 32s, 500 more iterations: 3h 22m 40s. [2025-11-13 09:59:51,808][__main__][INFO] - Starting iteration 311. [2025-11-13 09:59:51,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:51,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:01,553][__main__][INFO] - Number of regex retries in iteration 311: 0 [2025-11-13 10:00:01,554][__main__][INFO] - agents played in iteration 311 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:00:02,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,135][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:02,136][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:02,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:03,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:03,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:04,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:04,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:05,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:05,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:06,201][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:06,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:06,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:07,181][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:07,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:08,483][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:08,808][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:09,132][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:09,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:09,792][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:10,117][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:10,767][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:11,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:11,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:11,751][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:12,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:12,727][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:13,054][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:13,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:14,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:14,854][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:14,855][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:14,857][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:15,821][__main__][INFO] - Iteration 312 took 24s (40.58% Gen, 55.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 6s. Estimated total time: 20h 0m 33s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 5s. [2025-11-13 10:00:15,824][__main__][INFO] - Starting iteration 312. [2025-11-13 10:00:15,827][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:15,828][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:24,646][__main__][INFO] - Number of regex retries in iteration 312: 0 [2025-11-13 10:00:24,646][__main__][INFO] - agents played in iteration 312 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:00:25,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,218][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:25,219][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:26,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:26,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:27,233][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:27,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:27,885][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:28,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:28,538][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:28,864][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:29,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:30,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:30,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:31,139][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:31,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:32,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:32,443][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:33,093][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:33,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:34,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:34,724][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:35,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:35,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:35,701][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:36,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:37,077][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:37,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:37,912][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:37,914][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:38,890][__main__][INFO] - Iteration 313 took 23s (38.23% Gen, 57.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 21s. Estimated total time: 19h 13m 11s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 11s. [2025-11-13 10:00:38,892][__main__][INFO] - Starting iteration 313. [2025-11-13 10:00:38,895][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:38,896][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:48,745][__main__][INFO] - Number of regex retries in iteration 313: 0 [2025-11-13 10:00:48,745][__main__][INFO] - agents played in iteration 313 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:00:49,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:49,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:49,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:49,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:49,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:49,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:50,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:50,387][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:50,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:51,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:51,692][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:52,019][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:52,345][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:52,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:52,996][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:53,321][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:53,645][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:54,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:54,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:55,276][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:55,603][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:55,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:56,252][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:56,577][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:56,902][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:57,228][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:57,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:57,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:58,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:58,530][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:58,861][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:59,188][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:59,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:00,173][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:00,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:01,272][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:02,003][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:02,009][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:02,011][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:02,965][__main__][INFO] - Iteration 314 took 24s (40.92% Gen, 55.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 6m 19s. Estimated total time: 20h 3m 33s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 7s, 500 more iterations: 3h 20m 35s. [2025-11-13 10:01:02,967][__main__][INFO] - Starting iteration 314. [2025-11-13 10:01:02,970][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:02,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:12,489][__main__][INFO] - Number of regex retries in iteration 314: 0 [2025-11-13 10:01:12,490][__main__][INFO] - agents played in iteration 314 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:01:13,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:13,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:13,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:13,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:13,102][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:13,103][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:13,873][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:14,173][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:14,498][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:15,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:16,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:16,446][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:17,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:17,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:17,746][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:18,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:18,395][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:18,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:19,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:19,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:19,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:20,673][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:20,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:21,325][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:22,305][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:22,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:23,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:23,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:23,927][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:24,252][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:24,998][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:25,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:25,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:25,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:26,636][__main__][INFO] - Iteration 315 took 23s (40.23% Gen, 55.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 41s. Estimated total time: 19h 43m 18s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 13s. [2025-11-13 10:01:26,639][__main__][INFO] - Starting iteration 315. [2025-11-13 10:01:26,642][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:26,642][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:35,869][__main__][INFO] - Number of regex retries in iteration 315: 0 [2025-11-13 10:01:35,870][__main__][INFO] - agents played in iteration 315 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:01:36,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:36,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:36,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:36,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:36,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:36,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:37,857][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:38,833][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:39,159][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:40,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:40,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:40,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:41,436][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:41,763][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:42,087][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:42,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:42,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:43,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:43,387][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:43,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:44,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:45,019][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:46,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:46,661][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:46,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:47,314][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:47,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:48,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:49,132][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:49,134][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:49,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:50,106][__main__][INFO] - Iteration 316 took 23s (39.32% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 35m 12s. Estimated total time: 19h 33m 14s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 32s. [2025-11-13 10:01:50,108][__main__][INFO] - Starting iteration 316. [2025-11-13 10:01:50,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:50,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:59,547][__main__][INFO] - Number of regex retries in iteration 316: 0 [2025-11-13 10:01:59,548][__main__][INFO] - agents played in iteration 316 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:02:00,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:00,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:00,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:00,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:00,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:00,137][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:01,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:02,203][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:02,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:03,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:04,802][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:05,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:05,454][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:06,108][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:06,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:06,758][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:07,408][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:07,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:08,060][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:08,385][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:08,711][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:09,364][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:10,021][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:10,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:11,000][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:11,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:12,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:12,852][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:12,854][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:12,855][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:13,864][__main__][INFO] - Iteration 317 took 23s (39.72% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 17s. Estimated total time: 19h 47m 42s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 57s. [2025-11-13 10:02:13,867][__main__][INFO] - Starting iteration 317. [2025-11-13 10:02:13,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:02:13,871][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:23,195][__main__][INFO] - Number of regex retries in iteration 317: 0 [2025-11-13 10:02:23,195][__main__][INFO] - agents played in iteration 317 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:02:23,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:24,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:24,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:24,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:24,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:24,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:25,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:25,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:25,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:26,213][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:26,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:26,864][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:27,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:28,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:28,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:29,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:29,792][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:30,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:30,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:31,418][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:32,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:32,719][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:33,046][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:33,371][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:33,695][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:34,022][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:34,348][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:34,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:34,998][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:35,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:36,076][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:36,815][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:36,816][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:36,818][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:37,880][__main__][INFO] - Iteration 318 took 24s (38.83% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 1m 41s. Estimated total time: 20h 0m 31s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 5s. [2025-11-13 10:02:37,882][__main__][INFO] - Starting iteration 318. [2025-11-13 10:02:37,886][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:02:37,886][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:47,580][__main__][INFO] - Number of regex retries in iteration 318: 0 [2025-11-13 10:02:47,580][__main__][INFO] - agents played in iteration 318 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:02:48,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:48,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:48,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:48,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:48,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:48,170][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:48,958][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:49,579][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:50,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:51,207][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:52,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:53,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:53,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:53,810][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:54,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:54,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:55,119][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:55,770][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:56,097][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:56,423][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:56,748][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:57,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:58,054][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:58,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:59,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:00,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:00,873][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:00,875][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:00,877][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:01,917][__main__][INFO] - Iteration 319 took 24s (40.34% Gen, 55.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 24s. Estimated total time: 20h 1m 37s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 3s, 500 more iterations: 3h 20m 16s. [2025-11-13 10:03:01,920][__main__][INFO] - Starting iteration 319. [2025-11-13 10:03:01,923][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:03:01,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:11,593][__main__][INFO] - Number of regex retries in iteration 319: 0 [2025-11-13 10:03:11,594][__main__][INFO] - agents played in iteration 319 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:03:12,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:12,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:12,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:12,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:12,177][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:12,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:12,955][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:13,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:13,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:14,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:15,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:15,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:15,852][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:16,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:16,502][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:16,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:17,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:18,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:19,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:19,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:19,766][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:20,092][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:21,078][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:22,714][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:23,044][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:23,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:24,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:24,844][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:24,846][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:24,847][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:25,827][__main__][INFO] - Iteration 320 took 23s (40.45% Gen, 55.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 37s. Estimated total time: 19h 55m 14s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 12s. [2025-11-13 10:03:25,831][__main__][INFO] - Starting iteration 320. [2025-11-13 10:03:25,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:03:25,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:35,475][__main__][INFO] - Number of regex retries in iteration 320: 0 [2025-11-13 10:03:35,476][__main__][INFO] - agents played in iteration 320 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:03:35,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:36,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:36,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:36,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:36,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:36,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:37,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:37,478][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:38,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:38,455][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:38,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:39,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:39,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:39,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:40,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:40,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:40,730][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:41,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:42,031][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:42,355][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:42,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:43,003][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:43,328][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:43,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:43,980][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:44,305][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:44,957][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:45,282][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:46,255][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:47,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:47,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:48,691][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:48,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:48,694][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:50,667][__main__][INFO] - Iteration 321 took 24s (38.69% Gen, 53.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 41m 41s. Estimated total time: 20h 41m 43s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 23s, 500 more iterations: 3h 26m 57s. [2025-11-13 10:03:50,670][__main__][INFO] - Starting iteration 321. [2025-11-13 10:03:50,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:50,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:59,810][__main__][INFO] - Number of regex retries in iteration 321: 0 [2025-11-13 10:03:59,810][__main__][INFO] - agents played in iteration 321 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:04:00,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:00,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:00,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:00,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:00,388][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:00,388][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:01,167][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:01,465][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:01,792][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:02,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:02,451][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:02,770][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:03,096][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:03,420][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:03,749][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:04,395][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:05,050][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:05,371][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:05,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:06,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:06,350][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:06,676][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:07,000][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:07,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:08,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:09,290][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:09,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:09,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:10,273][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:10,598][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:10,928][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:11,255][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:11,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:12,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:13,051][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:13,052][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:13,054][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:14,044][__main__][INFO] - Iteration 322 took 23s (39.10% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 12s. Estimated total time: 19h 28m 37s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 46s. [2025-11-13 10:04:14,046][__main__][INFO] - Starting iteration 322. [2025-11-13 10:04:14,049][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:14,050][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:23,615][__main__][INFO] - Number of regex retries in iteration 322: 0 [2025-11-13 10:04:23,615][__main__][INFO] - agents played in iteration 322 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:04:24,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:24,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:24,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:24,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:24,199][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:24,199][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:24,973][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:25,269][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:25,595][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:25,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:26,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:27,220][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:28,213][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:28,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:28,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:29,515][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:29,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:30,175][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:30,506][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:30,833][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:31,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:31,488][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:31,811][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:32,137][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:32,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:32,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:33,111][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:33,436][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:33,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:34,731][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:35,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:36,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:36,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:36,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:36,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:37,840][__main__][INFO] - Iteration 323 took 23s (40.21% Gen, 55.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 45s. Estimated total time: 19h 49m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 15s. [2025-11-13 10:04:37,842][__main__][INFO] - Starting iteration 323. [2025-11-13 10:04:37,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:37,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:47,278][__main__][INFO] - Number of regex retries in iteration 323: 0 [2025-11-13 10:04:47,279][__main__][INFO] - agents played in iteration 323 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:04:47,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:47,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:47,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:47,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:47,875][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:47,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:48,982][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:49,308][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:49,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:50,283][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:50,934][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:51,259][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:51,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:51,909][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:52,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:52,565][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:53,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:53,542][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:54,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:54,848][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:55,498][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:55,822][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:56,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:56,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:56,796][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:57,121][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:57,445][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:57,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:58,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:59,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:59,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:00,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:00,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:00,553][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:01,547][__main__][INFO] - Iteration 324 took 23s (39.79% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 55s. Estimated total time: 19h 45m 8s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 31s. [2025-11-13 10:05:01,550][__main__][INFO] - Starting iteration 324. [2025-11-13 10:05:01,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:01,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:11,120][__main__][INFO] - Number of regex retries in iteration 324: 0 [2025-11-13 10:05:11,121][__main__][INFO] - agents played in iteration 324 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:05:11,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:11,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:11,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:11,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:11,697][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:11,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:12,483][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:12,780][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:13,106][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:13,430][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:13,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:14,092][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:14,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:14,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:15,073][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:15,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:15,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:16,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:16,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:16,706][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:17,033][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:17,683][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:18,334][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:18,665][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:18,982][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:19,311][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:19,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:20,938][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:21,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:21,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:22,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:23,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:24,342][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:24,345][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:24,348][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:25,376][__main__][INFO] - Iteration 325 took 23s (40.16% Gen, 55.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 35s. Estimated total time: 19h 51m 11s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 31s. [2025-11-13 10:05:25,378][__main__][INFO] - Starting iteration 325. [2025-11-13 10:05:25,383][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:25,383][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:31,148][mllm.models.large_language_model_local][WARNING] - Response |), retry 1/1 [2025-11-13 10:05:34,706][__main__][INFO] - Number of regex retries in iteration 325: 1 [2025-11-13 10:05:34,706][__main__][INFO] - agents played in iteration 325 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:05:35,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:35,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:35,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:35,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:35,657][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:35,657][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:36,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:37,144][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:37,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:37,798][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:38,447][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:39,751][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:40,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:40,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:41,387][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:42,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:42,364][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:42,688][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:43,335][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:43,659][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:44,307][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:44,631][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:44,955][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:45,279][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:45,603][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:46,250][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:46,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:47,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:48,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:48,371][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:48,373][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:49,387][__main__][INFO] - Iteration 326 took 24s (38.83% Gen, 56.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 58m 18s. Estimated total time: 20h 0m 19s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 0s, 500 more iterations: 3h 20m 3s. [2025-11-13 10:05:49,390][__main__][INFO] - Starting iteration 326. [2025-11-13 10:05:49,393][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:49,394][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:58,881][__main__][INFO] - Number of regex retries in iteration 326: 0 [2025-11-13 10:05:58,882][__main__][INFO] - agents played in iteration 326 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:05:59,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:59,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:59,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:59,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:59,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:59,466][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:00,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:00,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:00,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:01,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:01,511][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:01,837][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:02,491][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:02,820][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:04,125][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:04,450][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:04,775][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:05,102][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:05,753][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:06,084][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:06,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:06,728][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:07,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:08,349][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:08,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:09,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:09,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:09,973][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:10,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:10,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:11,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:12,101][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:12,103][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:12,105][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:13,088][__main__][INFO] - Iteration 327 took 23s (40.04% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 24s. Estimated total time: 19h 44m 48s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 28s. [2025-11-13 10:06:13,091][__main__][INFO] - Starting iteration 327. [2025-11-13 10:06:13,094][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:06:13,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:22,776][__main__][INFO] - Number of regex retries in iteration 327: 0 [2025-11-13 10:06:22,777][__main__][INFO] - agents played in iteration 327 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:06:23,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:23,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:23,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:23,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:23,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:23,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:24,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:24,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:25,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:25,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:26,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:26,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:27,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:27,674][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:28,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:28,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:29,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:30,288][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:30,613][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:32,239][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:32,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:32,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:33,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:33,541][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:33,865][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:34,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:35,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:35,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:35,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:35,975][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:36,959][__main__][INFO] - Iteration 328 took 23s (40.57% Gen, 55.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 29s. Estimated total time: 19h 53m 17s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 52s. [2025-11-13 10:06:36,961][__main__][INFO] - Starting iteration 328. [2025-11-13 10:06:36,964][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:06:36,964][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:46,984][__main__][INFO] - Number of regex retries in iteration 328: 0 [2025-11-13 10:06:46,984][__main__][INFO] - agents played in iteration 328 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:06:47,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:47,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:47,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:47,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:47,562][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:47,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:48,947][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:49,274][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:49,605][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:49,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:50,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:50,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:51,244][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:51,907][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:52,232][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:52,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:53,535][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:53,858][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:54,182][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:54,832][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:55,155][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:55,480][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:55,804][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:56,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:57,428][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:57,752][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:58,076][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:58,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:59,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:00,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:00,187][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:00,189][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:01,156][__main__][INFO] - Iteration 329 took 24s (41.41% Gen, 54.58% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 6m 27s. Estimated total time: 20h 9m 39s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 19s, 500 more iterations: 3h 21m 36s. [2025-11-13 10:07:01,158][__main__][INFO] - Starting iteration 329. [2025-11-13 10:07:01,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:07:01,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:10,610][__main__][INFO] - Number of regex retries in iteration 329: 0 [2025-11-13 10:07:10,611][__main__][INFO] - agents played in iteration 329 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:07:11,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:11,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:11,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:11,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:11,177][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:11,177][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:11,937][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:12,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:13,215][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:13,864][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:14,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:15,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:16,143][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:16,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:16,797][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:17,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:17,452][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:17,777][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:18,104][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:18,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:18,754][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:19,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:19,410][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:19,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:21,036][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:21,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:21,685][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:22,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:23,060][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:23,811][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:23,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:23,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:24,824][__main__][INFO] - Iteration 330 took 23s (39.93% Gen, 55.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 39m 34s. Estimated total time: 19h 43m 10s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 11s. [2025-11-13 10:07:24,826][__main__][INFO] - Starting iteration 330. [2025-11-13 10:07:24,830][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:07:24,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:34,075][__main__][INFO] - Number of regex retries in iteration 330: 0 [2025-11-13 10:07:34,076][__main__][INFO] - agents played in iteration 330 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:07:34,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:34,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:34,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:34,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:34,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:34,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:35,821][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:36,118][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:36,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:36,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:37,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:38,064][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:38,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:39,037][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:39,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:39,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:40,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:40,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:40,666][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:40,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:41,315][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:41,639][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:41,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:42,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:42,610][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:42,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:43,908][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:44,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:44,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:45,210][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:45,539][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:46,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:46,919][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:47,656][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:47,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:47,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:49,686][__main__][INFO] - Iteration 331 took 24s (37.19% Gen, 54.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 50s. Estimated total time: 20h 42m 51s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 25s, 500 more iterations: 3h 27m 8s. [2025-11-13 10:07:49,689][__main__][INFO] - Starting iteration 331. [2025-11-13 10:07:49,692][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:49,692][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:59,223][__main__][INFO] - Number of regex retries in iteration 331: 0 [2025-11-13 10:07:59,223][__main__][INFO] - agents played in iteration 331 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:07:59,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:59,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:59,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,153][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:00,153][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:00,929][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:01,555][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:01,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:03,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:04,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:04,478][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:04,803][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:05,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:05,452][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:05,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:06,101][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:06,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:06,751][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:07,075][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:07,398][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:07,722][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:08,371][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:08,695][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:09,019][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:09,346][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:09,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:09,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:10,314][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:10,643][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:10,962][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:11,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:12,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:12,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:12,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:12,753][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:13,755][__main__][INFO] - Iteration 332 took 24s (39.61% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 58m 46s. Estimated total time: 20h 3m 11s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 6s, 500 more iterations: 3h 20m 31s. [2025-11-13 10:08:13,757][__main__][INFO] - Starting iteration 332. [2025-11-13 10:08:13,760][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:13,761][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:22,624][__main__][INFO] - Number of regex retries in iteration 332: 0 [2025-11-13 10:08:22,624][__main__][INFO] - agents played in iteration 332 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:08:23,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,188][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:23,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:24,605][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:24,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:25,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:25,583][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:25,910][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:26,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:26,559][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:26,887][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:27,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:28,188][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:28,514][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:28,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:29,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:29,490][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:29,816][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:30,141][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:30,465][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:30,789][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:31,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:31,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:32,094][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:32,755][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:33,739][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:34,062][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:34,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:35,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:35,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:35,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:35,872][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:36,861][__main__][INFO] - Iteration 333 took 23s (38.37% Gen, 57.34% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 10m 18s. Estimated total time: 19h 15m 6s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 31s. [2025-11-13 10:08:36,864][__main__][INFO] - Starting iteration 333. [2025-11-13 10:08:36,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:36,867][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:46,378][__main__][INFO] - Number of regex retries in iteration 333: 0 [2025-11-13 10:08:46,379][__main__][INFO] - agents played in iteration 333 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:08:46,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:46,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:48,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:49,002][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:49,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:49,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:49,976][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:50,633][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:50,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:51,283][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:51,608][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:51,938][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:52,264][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:52,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:52,914][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:53,242][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:53,567][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:53,891][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:54,215][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:54,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:55,189][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:55,841][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:56,167][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:56,491][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:56,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:57,142][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:58,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:58,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:59,569][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:59,571][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:59,573][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:00,546][__main__][INFO] - Iteration 334 took 23s (40.17% Gen, 55.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 47s. Estimated total time: 19h 43m 59s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 19s. [2025-11-13 10:09:00,548][__main__][INFO] - Starting iteration 334. [2025-11-13 10:09:00,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:00,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:10,089][__main__][INFO] - Number of regex retries in iteration 334: 0 [2025-11-13 10:09:10,090][__main__][INFO] - agents played in iteration 334 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:09:10,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:10,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:10,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:10,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:10,661][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:10,662][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:11,423][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:12,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:12,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:12,699][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:13,025][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:13,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:13,997][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:14,656][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:14,980][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:15,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:16,286][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:16,610][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:17,583][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:18,232][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:18,556][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:18,880][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:19,204][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:19,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:19,853][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:20,501][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:21,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:21,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:22,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:23,230][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:23,232][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:23,234][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:24,338][__main__][INFO] - Iteration 335 took 23s (40.09% Gen, 55.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 48s. Estimated total time: 19h 49m 23s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 13s. [2025-11-13 10:09:24,341][__main__][INFO] - Starting iteration 335. [2025-11-13 10:09:24,345][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:24,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:34,193][__main__][INFO] - Number of regex retries in iteration 335: 0 [2025-11-13 10:09:34,194][__main__][INFO] - agents played in iteration 335 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:09:34,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:34,761][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:34,761][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:35,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:35,898][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:36,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:36,553][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:36,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:37,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:37,543][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:37,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:38,195][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:38,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:39,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:39,823][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:40,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:40,475][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:40,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:41,124][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:41,450][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:41,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:42,103][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:42,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:42,752][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:43,076][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:43,399][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:43,723][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:44,371][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:44,695][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:45,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:45,344][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:45,670][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:45,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:46,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:47,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:47,454][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:47,455][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:48,451][__main__][INFO] - Iteration 336 took 24s (40.85% Gen, 55.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 20s. Estimated total time: 20h 5m 19s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 10s, 500 more iterations: 3h 20m 53s. [2025-11-13 10:09:48,453][__main__][INFO] - Starting iteration 336. [2025-11-13 10:09:48,456][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:48,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:57,786][__main__][INFO] - Number of regex retries in iteration 336: 0 [2025-11-13 10:09:57,787][__main__][INFO] - agents played in iteration 336 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:09:58,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:58,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:59,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:00,436][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:00,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:01,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:01,744][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:02,400][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:02,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:03,053][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:03,381][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:03,707][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:04,034][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:04,366][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:04,702][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:05,031][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:05,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:05,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:06,023][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:06,347][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:06,673][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:07,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:07,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:07,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:08,961][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:09,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:10,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:11,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:11,064][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:11,066][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:12,072][__main__][INFO] - Iteration 337 took 23s (39.50% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 25s. Estimated total time: 19h 40m 48s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 48s. [2025-11-13 10:10:12,074][__main__][INFO] - Starting iteration 337. [2025-11-13 10:10:12,077][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:10:12,078][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:21,461][__main__][INFO] - Number of regex retries in iteration 337: 0 [2025-11-13 10:10:21,462][__main__][INFO] - agents played in iteration 337 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:10:21,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:22,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:22,029][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:22,029][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:22,808][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:23,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:23,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:23,755][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:24,086][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:24,735][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:25,060][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:25,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:26,038][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:26,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:27,013][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:27,992][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:28,317][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:28,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:29,292][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:29,615][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:30,589][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:30,913][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:31,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:31,886][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:32,210][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:32,534][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:32,858][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:33,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:33,872][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:34,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:34,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:34,619][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:35,587][__main__][INFO] - Iteration 338 took 23s (39.92% Gen, 55.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 47s. Estimated total time: 19h 35m 34s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 55s. [2025-11-13 10:10:35,590][__main__][INFO] - Starting iteration 338. [2025-11-13 10:10:35,593][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:10:35,594][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:44,609][__main__][INFO] - Number of regex retries in iteration 338: 0 [2025-11-13 10:10:44,610][__main__][INFO] - agents played in iteration 338 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:10:45,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,211][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:45,212][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:46,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:46,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:47,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:47,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:48,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:48,552][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:49,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:49,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:49,852][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:50,500][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:51,154][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:51,481][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:51,807][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:52,133][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:52,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:52,783][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:53,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:53,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:54,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:54,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:55,055][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:55,379][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:55,703][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:56,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:57,055][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:57,783][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:57,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:57,786][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:58,773][__main__][INFO] - Iteration 339 took 23s (38.90% Gen, 56.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 52s. Estimated total time: 19h 19m 2s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 10s. [2025-11-13 10:10:58,775][__main__][INFO] - Starting iteration 339. [2025-11-13 10:10:58,778][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:10:58,779][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:07,759][__main__][INFO] - Number of regex retries in iteration 339: 0 [2025-11-13 10:11:07,760][__main__][INFO] - agents played in iteration 339 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:11:08,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:08,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:08,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:08,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:08,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:08,690][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:09,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:09,778][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:10,104][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:10,429][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:11,078][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:11,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:11,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:12,378][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:12,707][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:13,031][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:13,356][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:13,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:14,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:14,335][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:14,660][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:14,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:15,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:15,650][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:15,978][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:16,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:16,633][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:17,292][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:17,622][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:17,950][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:18,271][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:18,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:18,927][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:19,260][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:19,577][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:19,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:20,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:21,340][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:21,342][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:21,343][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:22,361][__main__][INFO] - Iteration 340 took 23s (38.08% Gen, 57.60% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 31m 36s. Estimated total time: 19h 39m 10s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 31s. [2025-11-13 10:11:22,363][__main__][INFO] - Starting iteration 340. [2025-11-13 10:11:22,366][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:11:22,366][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:31,398][__main__][INFO] - Number of regex retries in iteration 340: 0 [2025-11-13 10:11:31,398][__main__][INFO] - agents played in iteration 340 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:11:31,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:31,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:31,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:31,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:31,962][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:31,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:32,743][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:33,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:34,012][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:34,662][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:35,637][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:35,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:36,289][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:37,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:37,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:38,258][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:38,584][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:39,244][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:39,898][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:40,227][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:40,557][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:40,883][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:41,214][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:41,865][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:42,188][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:42,512][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:42,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:43,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:43,844][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:44,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:44,593][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:44,594][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:46,581][__main__][INFO] - Iteration 341 took 24s (37.30% Gen, 54.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 48s. Estimated total time: 20h 10m 46s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 21s, 500 more iterations: 3h 21m 47s. [2025-11-13 10:11:46,583][__main__][INFO] - Starting iteration 341. [2025-11-13 10:11:46,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:46,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:55,816][__main__][INFO] - Number of regex retries in iteration 341: 0 [2025-11-13 10:11:55,817][__main__][INFO] - agents played in iteration 341 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:11:56,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:56,399][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:57,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:57,460][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:57,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:58,110][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:58,762][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:59,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:59,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:59,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:00,066][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:00,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:01,041][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:01,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:02,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:02,686][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:03,012][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:03,339][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:03,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:03,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:04,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:04,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:05,295][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:05,622][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:05,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:06,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:07,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:08,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:09,037][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:09,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:09,040][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:10,053][__main__][INFO] - Iteration 342 took 23s (39.33% Gen, 56.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 25m 4s. Estimated total time: 19h 33m 26s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 34s. [2025-11-13 10:12:10,056][__main__][INFO] - Starting iteration 342. [2025-11-13 10:12:10,058][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:10,059][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:19,370][__main__][INFO] - Number of regex retries in iteration 342: 0 [2025-11-13 10:12:19,371][__main__][INFO] - agents played in iteration 342 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:12:19,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:19,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:19,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:19,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:19,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:19,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:20,726][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:21,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:22,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:22,328][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:22,655][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:22,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:23,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:23,637][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:23,964][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:24,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:24,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:25,920][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:26,896][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:27,220][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:27,545][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:27,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:28,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:28,521][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:28,848][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:29,504][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:29,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:30,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:30,811][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:31,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:31,841][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:32,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:32,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:32,569][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:33,565][__main__][INFO] - Iteration 343 took 23s (39.61% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 26m 38s. Estimated total time: 19h 35m 23s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 53s. [2025-11-13 10:12:33,567][__main__][INFO] - Starting iteration 343. [2025-11-13 10:12:33,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:33,571][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:42,521][__main__][INFO] - Number of regex retries in iteration 343: 0 [2025-11-13 10:12:42,522][__main__][INFO] - agents played in iteration 343 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:12:42,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:43,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:43,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:43,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:43,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:43,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:43,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:44,178][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:44,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:44,827][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:45,151][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:45,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:46,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:46,457][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:47,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:47,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:48,086][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:48,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:49,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:50,694][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:51,024][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:51,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:51,675][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:52,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:52,329][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:52,979][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:53,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:53,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:53,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:54,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:54,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:55,728][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:55,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:55,731][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:56,712][__main__][INFO] - Iteration 344 took 23s (38.67% Gen, 57.08% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 7m 59s. Estimated total time: 19h 17m 7s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 51s. [2025-11-13 10:12:56,715][__main__][INFO] - Starting iteration 344. [2025-11-13 10:12:56,718][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:56,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:05,454][__main__][INFO] - Number of regex retries in iteration 344: 0 [2025-11-13 10:13:05,454][__main__][INFO] - agents played in iteration 344 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:13:05,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:05,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:05,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:06,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:06,032][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:06,032][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:06,813][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:07,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:07,761][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:08,086][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:08,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:08,736][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:09,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:09,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:10,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:11,016][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:11,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:11,993][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:12,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:12,646][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:12,972][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:13,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:13,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:14,274][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:14,603][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:14,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:15,590][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:15,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:16,243][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:16,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:16,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:17,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:17,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:18,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:18,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:18,648][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:19,629][__main__][INFO] - Iteration 345 took 22s (38.12% Gen, 57.59% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 56m 4s. Estimated total time: 19h 5m 35s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 55s. [2025-11-13 10:13:19,631][__main__][INFO] - Starting iteration 345. [2025-11-13 10:13:19,634][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:19,635][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:29,148][__main__][INFO] - Number of regex retries in iteration 345: 0 [2025-11-13 10:13:29,149][__main__][INFO] - agents played in iteration 345 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:13:29,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,737][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:29,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:30,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:30,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:31,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:32,440][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:32,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:33,092][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:33,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:33,743][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:34,403][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:34,729][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:35,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:35,703][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:36,028][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:36,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:37,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:38,302][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:38,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:39,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:39,930][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:40,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:40,582][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:40,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:41,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:42,359][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:42,360][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:42,362][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:43,357][__main__][INFO] - Iteration 346 took 23s (40.10% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 16s. Estimated total time: 19h 46m 11s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 41s. [2025-11-13 10:13:43,359][__main__][INFO] - Starting iteration 346. [2025-11-13 10:13:43,363][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:43,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:52,242][__main__][INFO] - Number of regex retries in iteration 346: 0 [2025-11-13 10:13:52,242][__main__][INFO] - agents played in iteration 346 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:13:52,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:52,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:52,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:52,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:52,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:52,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:53,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:54,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:54,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:54,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:55,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:55,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:56,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:56,859][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:57,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:57,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:58,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:58,486][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:59,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:59,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:59,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:00,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:00,764][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:01,091][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:01,416][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:01,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:02,397][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:02,723][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:03,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:03,374][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:03,700][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:04,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:05,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:05,818][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:05,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:05,821][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:06,830][__main__][INFO] - Iteration 347 took 23s (37.83% Gen, 57.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 23m 5s. Estimated total time: 19h 33m 23s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 10:14:06,832][__main__][INFO] - Starting iteration 347. [2025-11-13 10:14:06,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:14:06,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:15,743][__main__][INFO] - Number of regex retries in iteration 347: 0 [2025-11-13 10:14:15,744][__main__][INFO] - agents played in iteration 347 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:14:16,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:16,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:16,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:16,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:16,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:16,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:17,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:17,415][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:17,730][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:18,054][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:18,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:18,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:20,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:20,332][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:21,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:21,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:21,958][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:22,283][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:22,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:23,257][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:23,581][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:23,916][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:24,235][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:24,559][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:24,884][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:25,534][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:26,185][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:26,516][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:26,836][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:27,161][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:27,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:28,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:28,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:28,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:28,917][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:29,893][__main__][INFO] - Iteration 348 took 23s (38.63% Gen, 57.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 2m 14s. Estimated total time: 19h 12m 55s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 9s. [2025-11-13 10:14:29,895][__main__][INFO] - Starting iteration 348. [2025-11-13 10:14:29,900][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:14:29,900][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:39,140][__main__][INFO] - Number of regex retries in iteration 348: 0 [2025-11-13 10:14:39,140][__main__][INFO] - agents played in iteration 348 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:14:39,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:39,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:39,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:39,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:39,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:39,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:40,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:40,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:41,458][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:41,783][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:42,107][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:42,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:42,758][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:43,408][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:43,733][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:44,058][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:44,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:45,036][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:45,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:46,337][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:46,665][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:46,992][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:47,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:47,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:48,296][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:48,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:48,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:49,271][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:49,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:49,922][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:50,248][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:50,900][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:51,633][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:52,392][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:52,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:52,395][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:53,371][__main__][INFO] - Iteration 349 took 23s (39.36% Gen, 56.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 36s. Estimated total time: 19h 33m 40s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 36s. [2025-11-13 10:14:53,373][__main__][INFO] - Starting iteration 349. [2025-11-13 10:14:53,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:14:53,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:02,187][__main__][INFO] - Number of regex retries in iteration 349: 0 [2025-11-13 10:15:02,187][__main__][INFO] - agents played in iteration 349 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:15:02,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:02,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:02,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:02,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:02,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:02,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:03,546][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:03,843][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:04,168][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:04,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:04,818][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:05,143][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:05,468][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:05,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:06,122][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:06,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:07,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:07,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:08,720][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:09,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:10,353][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:11,007][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:12,961][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:13,617][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:13,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:14,685][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:15,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:15,444][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:15,446][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:16,423][__main__][INFO] - Iteration 350 took 23s (38.22% Gen, 57.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 0m 54s. Estimated total time: 19h 12m 21s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 3s. [2025-11-13 10:15:16,425][__main__][INFO] - Starting iteration 350. [2025-11-13 10:15:16,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:15:16,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:25,478][__main__][INFO] - Number of regex retries in iteration 350: 0 [2025-11-13 10:15:25,478][__main__][INFO] - agents played in iteration 350 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:15:25,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:25,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:26,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:26,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:26,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:26,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:26,849][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:27,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:27,471][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:27,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:28,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:28,445][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:28,770][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:29,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:29,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:30,071][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:30,720][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:31,044][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:31,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:32,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:32,672][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:33,649][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:33,974][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:34,297][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:34,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:35,271][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:35,597][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:36,248][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:36,900][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:37,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:37,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:38,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:38,707][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:38,709][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:40,697][__main__][INFO] - Iteration 351 took 24s (37.04% Gen, 54.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 1m 34s. Estimated total time: 20h 13m 26s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 26s, 500 more iterations: 3h 22m 14s. [2025-11-13 10:15:40,725][__main__][INFO] - Starting iteration 351. [2025-11-13 10:15:40,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:40,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:49,417][__main__][INFO] - Number of regex retries in iteration 351: 0 [2025-11-13 10:15:49,418][__main__][INFO] - agents played in iteration 351 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:15:49,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:49,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:49,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:50,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:50,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:50,009][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:50,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:51,432][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:51,757][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:52,082][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:52,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:52,738][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:53,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:53,388][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:53,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:54,038][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:54,688][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:55,016][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:56,318][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:56,644][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:56,969][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:57,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:57,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:58,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:58,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:58,932][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:00,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:00,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:01,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:01,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:02,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:02,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:02,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:03,667][__main__][INFO] - Iteration 352 took 22s (37.87% Gen, 57.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 54m 43s. Estimated total time: 19h 6m 58s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 9s. [2025-11-13 10:16:03,669][__main__][INFO] - Starting iteration 352. [2025-11-13 10:16:03,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:03,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:12,509][__main__][INFO] - Number of regex retries in iteration 352: 0 [2025-11-13 10:16:12,510][__main__][INFO] - agents played in iteration 352 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:16:12,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:13,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:13,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:13,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:13,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:13,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:13,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:14,171][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:15,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:15,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:15,800][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:16,126][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:16,452][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:16,776][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:17,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:17,427][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:17,752][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:18,401][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:18,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:19,051][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:20,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:20,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:20,681][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:21,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:21,659][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:22,311][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:22,637][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:23,937][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:24,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:25,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:25,786][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:25,789][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:25,791][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:26,773][__main__][INFO] - Iteration 353 took 23s (38.25% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 2m 29s. Estimated total time: 19h 15m 7s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 31s. [2025-11-13 10:16:26,776][__main__][INFO] - Starting iteration 353. [2025-11-13 10:16:26,779][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:26,780][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:35,421][__main__][INFO] - Number of regex retries in iteration 353: 0 [2025-11-13 10:16:35,422][__main__][INFO] - agents played in iteration 353 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:16:35,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:35,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:35,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:35,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:35,990][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:35,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:37,134][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:37,757][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:38,081][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:38,407][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:39,383][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:39,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:40,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:40,358][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:40,682][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:41,007][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:41,658][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:41,982][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:42,632][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:42,957][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:43,283][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:43,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:44,269][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:44,595][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:44,920][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:45,573][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:46,548][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:46,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:47,198][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:47,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:48,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:48,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:48,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:48,990][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:49,968][__main__][INFO] - Iteration 354 took 23s (37.26% Gen, 58.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 6m 27s. Estimated total time: 19h 19m 28s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 14s. [2025-11-13 10:16:49,970][__main__][INFO] - Starting iteration 354. [2025-11-13 10:16:49,973][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:49,974][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:59,393][__main__][INFO] - Number of regex retries in iteration 354: 0 [2025-11-13 10:16:59,393][__main__][INFO] - agents played in iteration 354 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:16:59,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:59,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:59,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:59,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:59,965][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:59,966][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:00,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:01,059][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:01,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:02,040][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:03,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:03,666][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:03,992][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:05,949][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:06,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:06,594][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:07,251][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:07,569][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:07,896][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:08,226][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:08,556][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:08,877][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:09,202][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:09,533][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:10,188][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:10,513][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:10,837][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:11,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:11,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:12,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:12,684][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:12,686][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:13,655][__main__][INFO] - Iteration 355 took 23s (39.77% Gen, 56.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 44s. Estimated total time: 19h 44m 9s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 21s. [2025-11-13 10:17:13,657][__main__][INFO] - Starting iteration 355. [2025-11-13 10:17:13,660][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:13,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:23,312][__main__][INFO] - Number of regex retries in iteration 355: 0 [2025-11-13 10:17:23,313][__main__][INFO] - agents played in iteration 355 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:17:23,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:23,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:23,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:23,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:23,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:23,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:24,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:25,294][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:25,618][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:25,945][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:26,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:26,919][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:27,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:27,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:27,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:28,217][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:28,541][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:28,867][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:29,192][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:29,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:30,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:30,492][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:30,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:31,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:32,121][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:32,446][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:33,095][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:33,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:34,071][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:34,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:34,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:35,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:35,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:36,588][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:36,590][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:36,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:37,565][__main__][INFO] - Iteration 356 took 23s (40.37% Gen, 55.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 29s. Estimated total time: 19h 55m 17s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 12s. [2025-11-13 10:17:37,568][__main__][INFO] - Starting iteration 356. [2025-11-13 10:17:37,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:37,572][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:46,986][__main__][INFO] - Number of regex retries in iteration 356: 0 [2025-11-13 10:17:46,987][__main__][INFO] - agents played in iteration 356 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:17:47,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:47,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:47,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:47,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:47,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:47,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:48,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:48,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:49,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:49,603][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:49,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:50,252][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:50,576][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:50,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:51,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:51,550][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:51,875][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:52,524][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:52,848][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:53,174][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:53,824][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:54,148][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:54,473][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:54,799][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:55,447][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:55,773][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:56,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:56,422][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:57,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:57,399][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:58,048][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:58,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:59,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:00,212][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:00,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:00,216][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:01,208][__main__][INFO] - Iteration 357 took 23s (39.83% Gen, 55.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 40s. Estimated total time: 19h 41m 52s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 58s. [2025-11-13 10:18:01,210][__main__][INFO] - Starting iteration 357. [2025-11-13 10:18:01,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:18:01,214][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:11,138][__main__][INFO] - Number of regex retries in iteration 357: 0 [2025-11-13 10:18:11,139][__main__][INFO] - agents played in iteration 357 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:18:11,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:11,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:11,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:11,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:11,730][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:11,731][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:12,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:12,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:13,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:13,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:14,113][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:14,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:14,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:15,414][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:15,744][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:16,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:16,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:17,045][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:17,369][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:18,020][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:18,345][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:18,672][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:18,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:19,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:19,655][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:19,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:20,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:20,632][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:21,610][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:22,590][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:22,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:23,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:24,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:24,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:24,410][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:25,441][__main__][INFO] - Iteration 358 took 24s (40.96% Gen, 54.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 56m 49s. Estimated total time: 20h 11m 25s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 54s. [2025-11-13 10:18:25,443][__main__][INFO] - Starting iteration 358. [2025-11-13 10:18:25,447][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:18:25,447][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:34,995][__main__][INFO] - Number of regex retries in iteration 358: 0 [2025-11-13 10:18:34,996][__main__][INFO] - agents played in iteration 358 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:18:35,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:35,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:35,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:35,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:35,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:35,575][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:36,649][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:37,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:37,626][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:37,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:38,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:39,251][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:39,575][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:40,549][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:40,874][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:41,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:41,853][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:42,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:42,507][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:42,832][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:43,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:43,492][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:44,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:44,471][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:44,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:45,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:46,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:46,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:47,494][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:48,239][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:48,241][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:48,242][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:49,284][__main__][INFO] - Iteration 359 took 23s (40.05% Gen, 55.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 55s. Estimated total time: 19h 51m 55s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 39s. [2025-11-13 10:18:49,287][__main__][INFO] - Starting iteration 359. [2025-11-13 10:18:49,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:18:49,291][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:58,649][__main__][INFO] - Number of regex retries in iteration 359: 0 [2025-11-13 10:18:58,650][__main__][INFO] - agents played in iteration 359 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:18:59,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,231][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:59,231][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:00,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:00,318][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:00,978][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:01,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:02,285][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:03,260][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:03,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:04,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:04,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:04,893][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:05,219][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:05,874][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:06,526][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:06,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:07,177][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:07,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:07,830][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:08,156][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:09,463][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:10,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:11,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:11,932][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:11,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:11,935][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:12,995][__main__][INFO] - Iteration 360 took 23s (39.48% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 53s. Estimated total time: 19h 45m 17s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 32s. [2025-11-13 10:19:12,997][__main__][INFO] - Starting iteration 360. [2025-11-13 10:19:13,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:19:13,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:21,988][__main__][INFO] - Number of regex retries in iteration 360: 0 [2025-11-13 10:19:21,989][__main__][INFO] - agents played in iteration 360 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:19:22,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:22,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:22,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:22,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:22,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:22,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:23,708][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:24,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:25,308][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:25,633][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:26,283][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:26,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:27,258][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:27,583][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:27,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:28,235][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:28,560][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:28,885][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:29,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:29,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:29,860][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:30,184][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:30,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:31,485][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:31,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:32,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:33,117][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:34,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:34,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:35,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:35,583][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:35,585][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:37,721][__main__][INFO] - Iteration 361 took 24s (36.36% Gen, 55.00% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 20m 19s. Estimated total time: 20h 36m 8s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 12s, 500 more iterations: 3h 26m 1s. [2025-11-13 10:19:37,724][__main__][INFO] - Starting iteration 361. [2025-11-13 10:19:37,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:37,728][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:47,732][__main__][INFO] - Number of regex retries in iteration 361: 0 [2025-11-13 10:19:47,733][__main__][INFO] - agents played in iteration 361 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:19:48,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:48,305][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:48,305][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:49,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:49,399][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:49,725][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:50,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:50,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:51,030][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:51,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:51,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:52,333][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:52,658][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:52,984][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:53,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:53,635][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:53,959][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:54,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:54,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:55,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:56,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:56,891][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:57,219][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:57,544][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:57,875][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:58,201][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:58,525][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:59,177][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:59,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:00,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:01,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:01,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:01,045][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:02,077][__main__][INFO] - Iteration 362 took 24s (41.08% Gen, 54.67% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 1m 16s. Estimated total time: 20h 17m 29s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 34s, 500 more iterations: 3h 22m 54s. [2025-11-13 10:20:02,079][__main__][INFO] - Starting iteration 362. [2025-11-13 10:20:02,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:02,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:11,506][__main__][INFO] - Number of regex retries in iteration 362: 0 [2025-11-13 10:20:11,507][__main__][INFO] - agents played in iteration 362 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:20:11,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:12,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:12,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:12,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:12,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:12,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:13,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:14,183][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:14,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:15,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:15,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:16,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:16,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:16,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:17,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:17,436][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:17,760][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:18,410][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:18,734][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:19,058][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:19,388][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:19,715][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:20,042][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:20,369][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:21,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:21,682][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:22,655][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:22,984][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:23,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:24,064][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:24,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:24,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:24,839][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:25,827][__main__][INFO] - Iteration 363 took 23s (39.68% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 34s. Estimated total time: 19h 47m 11s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 51s. [2025-11-13 10:20:25,829][__main__][INFO] - Starting iteration 363. [2025-11-13 10:20:25,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:25,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:34,964][__main__][INFO] - Number of regex retries in iteration 363: 0 [2025-11-13 10:20:34,964][__main__][INFO] - agents played in iteration 363 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:20:35,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:35,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:35,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:35,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:35,541][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:35,541][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:36,633][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:36,959][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:37,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:37,936][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:38,261][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:38,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:39,235][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:40,536][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:41,510][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:43,470][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:44,124][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:45,095][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:46,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:46,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:47,467][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:48,230][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:48,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:48,233][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:49,227][__main__][INFO] - Iteration 364 took 23s (39.03% Gen, 56.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 47s. Estimated total time: 19h 29m 47s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 57s. [2025-11-13 10:20:49,230][__main__][INFO] - Starting iteration 364. [2025-11-13 10:20:49,233][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:49,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:58,813][__main__][INFO] - Number of regex retries in iteration 364: 0 [2025-11-13 10:20:58,813][__main__][INFO] - agents played in iteration 364 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:20:59,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:59,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:59,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:59,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:59,384][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:59,385][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:00,176][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:00,473][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:00,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:01,124][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:01,448][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:02,423][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:02,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:03,073][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:03,398][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:04,372][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:05,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:05,346][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:05,671][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:05,998][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:06,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:06,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:07,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:08,625][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:08,950][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:09,279][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:09,606][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:10,263][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:10,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:11,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:12,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:12,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:12,095][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:13,072][__main__][INFO] - Iteration 365 took 23s (40.18% Gen, 55.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 35s. Estimated total time: 19h 51m 59s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 39s. [2025-11-13 10:21:13,075][__main__][INFO] - Starting iteration 365. [2025-11-13 10:21:13,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:13,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:22,399][__main__][INFO] - Number of regex retries in iteration 365: 0 [2025-11-13 10:21:22,400][__main__][INFO] - agents played in iteration 365 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:21:22,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:22,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:22,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:22,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:22,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:22,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:23,753][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:24,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:24,375][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:25,349][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:25,674][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:25,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:26,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:26,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:26,974][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:27,298][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:27,947][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:28,272][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:28,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:28,921][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:29,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:29,897][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:30,222][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:30,549][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:30,877][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:31,202][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:31,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:31,852][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:32,178][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:32,830][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:33,480][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:34,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:34,863][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:35,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:35,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:35,628][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:36,609][__main__][INFO] - Iteration 366 took 23s (39.61% Gen, 56.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 48s. Estimated total time: 19h 36m 36s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 13s, 500 more iterations: 3h 16m 6s. [2025-11-13 10:21:36,612][__main__][INFO] - Starting iteration 366. [2025-11-13 10:21:36,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:36,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:45,746][__main__][INFO] - Number of regex retries in iteration 366: 0 [2025-11-13 10:21:45,747][__main__][INFO] - agents played in iteration 366 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:21:46,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:46,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:46,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:46,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:46,324][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:46,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:47,127][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:47,424][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:48,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:48,399][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:48,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:49,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:49,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:50,021][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:50,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:50,671][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:51,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:51,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:51,974][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:52,299][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:52,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:52,955][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:53,281][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:53,607][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:53,932][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:54,257][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:54,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:54,912][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:55,230][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:55,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:56,540][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:57,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:58,226][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:58,968][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:58,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:58,972][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:59,954][__main__][INFO] - Iteration 367 took 23s (39.12% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 8m 47s. Estimated total time: 19h 26m 58s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 29s. [2025-11-13 10:21:59,956][__main__][INFO] - Starting iteration 367. [2025-11-13 10:21:59,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:59,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:09,127][__main__][INFO] - Number of regex retries in iteration 367: 0 [2025-11-13 10:22:09,128][__main__][INFO] - agents played in iteration 367 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:22:09,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:09,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:10,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:10,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:10,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:10,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:10,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:11,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:11,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:11,800][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:12,125][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:12,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:13,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:14,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:14,402][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:14,727][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:15,052][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:15,377][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:16,027][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:16,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:16,677][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:17,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:17,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:18,307][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:18,633][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:18,956][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:19,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:19,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:20,261][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:20,915][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:21,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:21,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:22,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:22,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:22,683][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:23,703][__main__][INFO] - Iteration 368 took 23s (38.61% Gen, 57.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 38s. Estimated total time: 19h 47m 13s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 52s. [2025-11-13 10:22:23,709][__main__][INFO] - Starting iteration 368. [2025-11-13 10:22:23,731][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:22:23,731][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:33,066][__main__][INFO] - Number of regex retries in iteration 368: 0 [2025-11-13 10:22:33,067][__main__][INFO] - agents played in iteration 368 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:22:33,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:33,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:33,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:33,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:33,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:33,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:34,428][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:34,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:35,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:35,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:35,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:36,028][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:36,354][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:37,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:37,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:37,657][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:37,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:38,310][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:38,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:38,959][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:39,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:39,611][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:39,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:40,265][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:40,592][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:41,246][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:41,573][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:41,897][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:42,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:42,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:42,873][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:43,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:44,174][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:44,498][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:44,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:45,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:46,274][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:46,276][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:46,277][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:47,292][__main__][INFO] - Iteration 369 took 23s (39.59% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 0s. Estimated total time: 19h 38m 58s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 29s. [2025-11-13 10:22:47,294][__main__][INFO] - Starting iteration 369. [2025-11-13 10:22:47,297][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:22:47,298][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:55,688][__main__][INFO] - Number of regex retries in iteration 369: 0 [2025-11-13 10:22:55,689][__main__][INFO] - agents played in iteration 369 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:22:56,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:56,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:56,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:56,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:56,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:56,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:57,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:57,671][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:57,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:58,325][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:58,650][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:58,975][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:59,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:59,948][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:01,251][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:01,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:01,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:02,235][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:02,560][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:02,885][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:03,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:03,864][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:04,191][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:05,172][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:05,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:05,828][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:06,154][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:06,806][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:07,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:07,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:08,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:08,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:08,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:08,985][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:09,979][__main__][INFO] - Iteration 370 took 22s (36.99% Gen, 58.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 34m 47s. Estimated total time: 18h 54m 8s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 48s, 500 more iterations: 3h 9m 1s. [2025-11-13 10:23:09,981][__main__][INFO] - Starting iteration 370. [2025-11-13 10:23:09,985][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:23:09,985][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:19,615][__main__][INFO] - Number of regex retries in iteration 370: 0 [2025-11-13 10:23:19,616][__main__][INFO] - agents played in iteration 370 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:23:20,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:20,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:20,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:20,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:20,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:20,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:21,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:22,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:22,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:22,912][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:23,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:23,562][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:23,886][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:24,212][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:24,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:25,523][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:26,504][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:26,834][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:27,161][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:27,813][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:28,464][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:28,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:29,115][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:29,448][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:30,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:30,419][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:30,746][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:31,077][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:31,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:32,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:32,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:32,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:32,870][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:34,966][__main__][INFO] - Iteration 371 took 24s (38.55% Gen, 53.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 20s. Estimated total time: 20h 49m 6s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 38s, 500 more iterations: 3h 28m 11s. [2025-11-13 10:23:34,968][__main__][INFO] - Starting iteration 371. [2025-11-13 10:23:34,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:34,972][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:43,469][__main__][INFO] - Number of regex retries in iteration 371: 0 [2025-11-13 10:23:43,470][__main__][INFO] - agents played in iteration 371 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:23:43,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:43,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:44,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:44,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:44,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:44,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:44,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:45,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:45,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:46,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:46,748][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:47,403][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:47,736][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:48,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:48,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:50,008][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:50,341][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:50,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:50,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:51,636][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:52,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:53,274][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:53,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:53,929][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:54,260][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:54,587][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:54,913][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:55,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:56,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:56,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:56,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:56,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:57,713][__main__][INFO] - Iteration 372 took 22s (37.37% Gen, 58.35% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 36m 57s. Estimated total time: 18h 57m 6s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 54s, 500 more iterations: 3h 9m 31s. [2025-11-13 10:23:57,715][__main__][INFO] - Starting iteration 372. [2025-11-13 10:23:57,719][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:57,719][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:06,748][__main__][INFO] - Number of regex retries in iteration 372: 0 [2025-11-13 10:24:06,749][__main__][INFO] - agents played in iteration 372 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:24:07,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:07,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:07,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:07,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:07,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:07,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:08,404][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:08,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:09,056][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:09,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:09,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:11,007][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:11,657][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:11,982][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:12,307][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:12,641][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:13,283][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:13,609][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:13,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:14,585][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:14,910][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:15,234][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:15,563][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:15,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:16,212][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:17,188][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:17,512][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:17,837][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:18,161][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:18,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:19,260][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:20,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:20,009][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:20,011][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:21,006][__main__][INFO] - Iteration 373 took 23s (38.77% Gen, 56.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 52s. Estimated total time: 19h 24m 24s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 4s. [2025-11-13 10:24:21,008][__main__][INFO] - Starting iteration 373. [2025-11-13 10:24:21,012][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:21,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:30,645][__main__][INFO] - Number of regex retries in iteration 373: 0 [2025-11-13 10:24:30,646][__main__][INFO] - agents played in iteration 373 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:24:31,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:31,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:31,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:31,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:31,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:31,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:31,996][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:32,950][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:33,277][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:33,605][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:33,929][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:34,254][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:34,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:35,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:35,558][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:35,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:36,533][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:36,859][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:37,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:37,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:37,848][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:38,492][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:39,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:39,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:40,442][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:40,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:41,416][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:41,740][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:42,065][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:42,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:43,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:43,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:43,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:43,872][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:45,448][__main__][INFO] - Iteration 374 took 24s (39.42% Gen, 54.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 0m 54s. Estimated total time: 20h 21m 51s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 43s, 500 more iterations: 3h 23m 38s. [2025-11-13 10:24:45,450][__main__][INFO] - Starting iteration 374. [2025-11-13 10:24:45,453][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:45,454][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:54,332][__main__][INFO] - Number of regex retries in iteration 374: 0 [2025-11-13 10:24:54,333][__main__][INFO] - agents played in iteration 374 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:24:54,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:54,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:54,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:54,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:54,897][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:54,897][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:55,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:56,638][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:56,963][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:57,288][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:57,615][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:57,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:58,266][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:58,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:58,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:59,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:59,565][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:59,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:00,216][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:00,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:00,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:01,516][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:01,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:02,168][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:02,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:02,818][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:03,792][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:04,115][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:04,440][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:05,090][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:06,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:06,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:07,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:07,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:07,539][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:08,531][__main__][INFO] - Iteration 375 took 23s (38.47% Gen, 57.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 52m 35s. Estimated total time: 19h 13m 55s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 19s. [2025-11-13 10:25:08,533][__main__][INFO] - Starting iteration 375. [2025-11-13 10:25:08,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:08,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:17,662][__main__][INFO] - Number of regex retries in iteration 375: 0 [2025-11-13 10:25:17,663][__main__][INFO] - agents played in iteration 375 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:25:18,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:18,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:18,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:18,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:18,224][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:18,225][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:19,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:19,954][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:20,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:20,939][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:21,267][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:21,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:22,249][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:22,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:23,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:23,874][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:24,201][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:24,527][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:24,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:25,180][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:25,505][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:25,829][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:26,154][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:26,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:27,458][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:27,782][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:28,434][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:28,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:29,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:29,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:30,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:30,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:30,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:30,885][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:31,932][__main__][INFO] - Iteration 376 took 23s (39.00% Gen, 56.51% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 8m 7s. Estimated total time: 19h 29m 50s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 58s. [2025-11-13 10:25:31,935][__main__][INFO] - Starting iteration 376. [2025-11-13 10:25:31,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:31,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:40,228][__main__][INFO] - Number of regex retries in iteration 376: 0 [2025-11-13 10:25:40,229][__main__][INFO] - agents played in iteration 376 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:25:40,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:40,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:40,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:40,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:40,799][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:40,800][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:41,590][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:41,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:42,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:42,539][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:43,191][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:43,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:43,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:44,172][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:44,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:44,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:45,151][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:45,475][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:45,800][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:46,125][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:46,455][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:47,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:48,083][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:48,408][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:48,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:49,058][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:49,384][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:49,710][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:50,034][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:51,017][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:51,346][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:51,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:51,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:52,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:53,459][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:53,461][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:53,462][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:54,456][__main__][INFO] - Iteration 377 took 22s (36.76% Gen, 58.77% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 23m 49s. Estimated total time: 18h 45m 55s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 31s, 500 more iterations: 3h 7m 39s. [2025-11-13 10:25:54,458][__main__][INFO] - Starting iteration 377. [2025-11-13 10:25:54,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:54,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:03,956][__main__][INFO] - Number of regex retries in iteration 377: 0 [2025-11-13 10:26:03,957][__main__][INFO] - agents played in iteration 377 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:26:04,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:04,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:04,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:04,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:04,521][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:04,521][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:05,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:05,585][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:05,913][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:06,248][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:06,909][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:07,563][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:07,891][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:08,216][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:08,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:08,866][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:09,190][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:09,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:10,166][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:10,816][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:11,144][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:11,468][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:11,793][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:12,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:13,754][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:14,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:14,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:14,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:15,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:15,372][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:15,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:16,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:17,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:17,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:17,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:18,174][__main__][INFO] - Iteration 378 took 23s (40.04% Gen, 55.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 12s. Estimated total time: 19h 45m 41s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 36s. [2025-11-13 10:26:18,176][__main__][INFO] - Starting iteration 378. [2025-11-13 10:26:18,180][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:26:18,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:27,304][__main__][INFO] - Number of regex retries in iteration 378: 0 [2025-11-13 10:26:27,304][__main__][INFO] - agents played in iteration 378 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:26:27,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:27,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:27,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:28,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:29,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:29,908][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:30,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:30,563][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:30,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:31,540][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:31,865][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:32,195][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:32,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:32,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:33,168][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:33,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:33,819][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:34,147][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:34,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:35,123][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:35,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:35,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:36,424][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:36,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:37,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:38,047][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:38,698][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:39,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:39,730][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:40,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:40,489][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:40,491][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:41,469][__main__][INFO] - Iteration 379 took 23s (39.18% Gen, 56.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 38s. Estimated total time: 19h 24m 30s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 5s. [2025-11-13 10:26:41,471][__main__][INFO] - Starting iteration 379. [2025-11-13 10:26:41,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:26:41,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:49,605][__main__][INFO] - Number of regex retries in iteration 379: 0 [2025-11-13 10:26:49,606][__main__][INFO] - agents played in iteration 379 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:26:50,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:50,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:50,957][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:51,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:51,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:51,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:52,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:52,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:53,543][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:53,870][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:54,850][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:55,501][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:55,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:56,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:56,477][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:57,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:57,779][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:58,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:58,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:58,758][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:59,083][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:59,409][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:59,734][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:00,711][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:01,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:01,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:02,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:02,831][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:02,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:02,834][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:03,840][__main__][INFO] - Iteration 380 took 22s (36.35% Gen, 59.15% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 15m 2s. Estimated total time: 18h 38m 17s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 16s, 500 more iterations: 3h 6m 22s. [2025-11-13 10:27:03,842][__main__][INFO] - Starting iteration 380. [2025-11-13 10:27:03,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:27:03,845][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:11,579][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2025-11-13 10:27:12,943][__main__][INFO] - Number of regex retries in iteration 380: 1 [2025-11-13 10:27:12,944][__main__][INFO] - agents played in iteration 380 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:27:13,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:13,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:13,511][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:14,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:14,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:14,904][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:15,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:15,562][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:15,888][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:16,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:16,867][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:17,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:17,519][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:17,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:18,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:19,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:19,800][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:20,125][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:21,105][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:21,430][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:22,407][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:22,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:23,384][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:23,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:24,034][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:24,363][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:24,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:25,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:26,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:26,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:26,164][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:28,102][__main__][INFO] - Iteration 381 took 24s (37.50% Gen, 54.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 16s. Estimated total time: 20h 12m 55s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 25s, 500 more iterations: 3h 22m 9s. [2025-11-13 10:27:28,105][__main__][INFO] - Starting iteration 381. [2025-11-13 10:27:28,109][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:28,109][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:37,397][__main__][INFO] - Number of regex retries in iteration 381: 0 [2025-11-13 10:27:37,397][__main__][INFO] - agents played in iteration 381 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:27:37,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:37,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:37,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:38,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:38,014][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:38,014][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:38,767][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:39,713][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:40,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:40,363][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:40,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:41,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:41,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:41,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:42,323][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:42,656][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:42,982][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:43,961][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:44,612][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:45,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:45,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:46,248][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:46,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:46,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:47,224][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:47,550][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:48,528][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:48,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:49,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:49,908][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:50,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:50,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:50,648][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:51,669][__main__][INFO] - Iteration 382 took 23s (39.42% Gen, 56.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 3s. Estimated total time: 19h 38m 5s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 20s. [2025-11-13 10:27:51,672][__main__][INFO] - Starting iteration 382. [2025-11-13 10:27:51,675][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:51,676][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:00,796][__main__][INFO] - Number of regex retries in iteration 382: 0 [2025-11-13 10:28:00,797][__main__][INFO] - agents played in iteration 382 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:28:01,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,375][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:01,376][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:02,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:02,772][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:03,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:03,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:04,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:04,389][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:04,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:05,039][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:05,365][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:05,690][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:06,015][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:06,341][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:06,666][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:06,994][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:07,645][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:07,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:08,297][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:08,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:09,275][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:09,601][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:09,927][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:10,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:11,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:11,555][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:11,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:12,531][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:13,273][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:14,028][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:14,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:14,031][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:15,330][__main__][INFO] - Iteration 383 took 23s (38.56% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 22s. Estimated total time: 19h 42m 48s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 8s. [2025-11-13 10:28:15,332][__main__][INFO] - Starting iteration 383. [2025-11-13 10:28:15,335][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:15,336][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:24,196][__main__][INFO] - Number of regex retries in iteration 383: 0 [2025-11-13 10:28:24,197][__main__][INFO] - agents played in iteration 383 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:28:24,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:24,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:25,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:26,151][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:26,476][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:26,801][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:27,125][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:27,450][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:27,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:28,429][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:29,089][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:30,069][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:30,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:31,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:32,029][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:33,005][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:33,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:33,981][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:34,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:34,956][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:35,280][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:35,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:36,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:37,427][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:37,431][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:37,433][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:38,535][__main__][INFO] - Iteration 384 took 23s (38.19% Gen, 57.05% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 55m 11s. Estimated total time: 19h 20m 1s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 20s. [2025-11-13 10:28:38,537][__main__][INFO] - Starting iteration 384. [2025-11-13 10:28:38,540][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:38,540][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:47,227][__main__][INFO] - Number of regex retries in iteration 384: 0 [2025-11-13 10:28:47,228][__main__][INFO] - agents played in iteration 384 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:28:47,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:47,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:48,567][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:48,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:49,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:50,818][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:51,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:51,466][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:54,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:54,400][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:55,051][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:55,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:56,351][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:56,678][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:57,003][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:57,329][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:57,983][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:58,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:58,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:59,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:00,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:00,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:00,437][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:01,464][__main__][INFO] - Iteration 385 took 22s (37.89% Gen, 57.62% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 41m 1s. Estimated total time: 19h 6m 14s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 2s. [2025-11-13 10:29:01,466][__main__][INFO] - Starting iteration 385. [2025-11-13 10:29:01,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:01,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:10,088][__main__][INFO] - Number of regex retries in iteration 385: 0 [2025-11-13 10:29:10,088][__main__][INFO] - agents played in iteration 385 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:29:10,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:10,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:11,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:12,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:12,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:12,717][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:13,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:13,368][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:13,696][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:14,348][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:14,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:15,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:15,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:15,977][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:16,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:16,955][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:17,283][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:17,607][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:17,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:18,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:18,585][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:18,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:19,889][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:20,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:20,539][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:20,864][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:21,190][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:21,839][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:22,582][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:23,291][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:23,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:23,295][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:24,264][__main__][INFO] - Iteration 386 took 22s (37.81% Gen, 57.94% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 34m 11s. Estimated total time: 18h 59m 47s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 59s, 500 more iterations: 3h 9m 57s. [2025-11-13 10:29:24,266][__main__][INFO] - Starting iteration 386. [2025-11-13 10:29:24,270][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:24,270][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:33,417][__main__][INFO] - Number of regex retries in iteration 386: 0 [2025-11-13 10:29:33,417][__main__][INFO] - agents played in iteration 386 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:29:33,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:33,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:33,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:33,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:33,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:33,989][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:34,764][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:35,058][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:36,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:36,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:37,011][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:37,335][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:37,985][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:38,312][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:38,636][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:38,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:39,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:39,939][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:40,270][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:40,606][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:41,581][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:41,908][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:42,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:43,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:43,538][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:43,860][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:44,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:44,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:45,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:45,883][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:46,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:46,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:46,634][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:47,681][__main__][INFO] - Iteration 387 took 23s (39.07% Gen, 56.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 4m 36s. Estimated total time: 19h 30m 35s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 5s. [2025-11-13 10:29:47,683][__main__][INFO] - Starting iteration 387. [2025-11-13 10:29:47,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:47,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:56,972][__main__][INFO] - Number of regex retries in iteration 387: 0 [2025-11-13 10:29:56,973][__main__][INFO] - agents played in iteration 387 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:29:57,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:57,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:57,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:57,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:57,539][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:57,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:58,934][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:59,591][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:00,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:00,564][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:00,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:01,214][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:01,540][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:01,865][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:02,200][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:02,527][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:02,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:03,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:03,519][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:03,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:04,172][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:05,151][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:05,478][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:05,804][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:06,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:07,436][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:07,761][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:08,089][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:08,413][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:08,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:09,476][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:10,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:10,208][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:10,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:11,199][__main__][INFO] - Iteration 388 took 23s (39.49% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 20s. Estimated total time: 19h 35m 42s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 57s. [2025-11-13 10:30:11,201][__main__][INFO] - Starting iteration 388. [2025-11-13 10:30:11,204][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:30:11,205][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:19,819][__main__][INFO] - Number of regex retries in iteration 388: 0 [2025-11-13 10:30:19,820][__main__][INFO] - agents played in iteration 388 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:30:20,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:20,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:20,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:20,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:20,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:20,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:21,177][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:21,802][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:22,128][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:22,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:22,776][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:23,104][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:23,429][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:23,755][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:24,404][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:24,731][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:25,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:25,387][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:25,713][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:26,036][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:26,688][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:27,014][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:27,340][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:27,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:27,989][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:28,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:28,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:29,298][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:29,630][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:29,948][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:30,271][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:30,602][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:30,934][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:31,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:31,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:32,348][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:33,064][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:33,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:33,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:34,016][__main__][INFO] - Iteration 389 took 22s (37.77% Gen, 58.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 33m 52s. Estimated total time: 19h 0m 37s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 1s, 500 more iterations: 3h 10m 6s. [2025-11-13 10:30:34,018][__main__][INFO] - Starting iteration 389. [2025-11-13 10:30:34,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:30:34,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:42,880][__main__][INFO] - Number of regex retries in iteration 389: 0 [2025-11-13 10:30:42,880][__main__][INFO] - agents played in iteration 389 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:30:43,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:43,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:43,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:43,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:43,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:43,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:44,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:44,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:45,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:45,547][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:45,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:46,192][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:46,517][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:47,820][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:48,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:48,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:48,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:49,121][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:49,447][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:49,771][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:50,752][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:51,080][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:51,409][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:51,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:52,060][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:52,384][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:52,710][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:53,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:53,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:54,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:54,672][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:55,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:55,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:56,493][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:56,496][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:56,497][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:57,426][__main__][INFO] - Iteration 390 took 23s (37.85% Gen, 58.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 3m 7s. Estimated total time: 19h 30m 16s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 2s. [2025-11-13 10:30:57,429][__main__][INFO] - Starting iteration 390. [2025-11-13 10:30:57,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:30:57,433][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:06,584][__main__][INFO] - Number of regex retries in iteration 390: 0 [2025-11-13 10:31:06,585][__main__][INFO] - agents played in iteration 390 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:31:07,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:07,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:07,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:07,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:07,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:07,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:07,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:08,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:08,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:09,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:09,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:09,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:10,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:10,539][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:11,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:11,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:12,162][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:12,488][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:12,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:13,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:13,467][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:13,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:14,122][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:14,448][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:14,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:15,103][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:15,427][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:15,754][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:16,080][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:16,733][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:17,382][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:17,707][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:18,032][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:18,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:19,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:19,830][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:19,832][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:19,834][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:21,754][__main__][INFO] - Iteration 391 took 24s (37.62% Gen, 54.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 35s. Estimated total time: 20h 16m 7s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 32s, 500 more iterations: 3h 22m 41s. [2025-11-13 10:31:21,756][__main__][INFO] - Starting iteration 391. [2025-11-13 10:31:21,760][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:21,761][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:31,085][__main__][INFO] - Number of regex retries in iteration 391: 0 [2025-11-13 10:31:31,086][__main__][INFO] - agents played in iteration 391 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:31:31,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:31,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:31,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:31,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:31,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:31,677][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:32,439][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:32,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:33,061][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:33,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:35,015][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:35,664][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:35,988][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:36,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:37,612][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:37,945][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:38,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:39,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:39,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:39,897][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:40,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:40,876][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:41,201][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:41,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:42,502][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:42,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:43,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:44,273][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:44,282][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:44,284][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:45,231][__main__][INFO] - Iteration 392 took 23s (39.73% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 5m 38s. Estimated total time: 19h 33m 35s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 35s. [2025-11-13 10:31:45,233][__main__][INFO] - Starting iteration 392. [2025-11-13 10:31:45,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:45,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:54,628][__main__][INFO] - Number of regex retries in iteration 392: 0 [2025-11-13 10:31:54,629][__main__][INFO] - agents played in iteration 392 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:31:55,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:55,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:55,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:55,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:55,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:55,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:55,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:56,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:56,600][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:56,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:57,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:57,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:58,224][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:58,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:59,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:59,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:59,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:00,182][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:01,159][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:01,486][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:01,816][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:02,146][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:02,473][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:02,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:03,136][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:03,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:04,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:04,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:05,409][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:05,734][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:06,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:06,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:07,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:07,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:07,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:07,886][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:08,881][__main__][INFO] - Iteration 393 took 23s (39.72% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 56s. Estimated total time: 19h 42m 16s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 10:32:08,883][__main__][INFO] - Starting iteration 393. [2025-11-13 10:32:08,887][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:08,887][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:18,175][__main__][INFO] - Number of regex retries in iteration 393: 0 [2025-11-13 10:32:18,176][__main__][INFO] - agents played in iteration 393 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:32:18,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:18,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:18,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:18,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:18,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:18,753][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:20,146][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:20,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:20,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:21,770][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:22,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:23,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:23,726][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:24,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:24,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:24,702][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:25,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:26,006][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:26,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:27,307][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:28,283][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:28,613][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:28,937][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:29,593][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:29,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:30,635][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:31,366][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:31,367][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:31,369][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:32,385][__main__][INFO] - Iteration 394 took 23s (39.53% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 12s. Estimated total time: 19h 34m 56s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 49s. [2025-11-13 10:32:32,387][__main__][INFO] - Starting iteration 394. [2025-11-13 10:32:32,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:32,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:40,885][__main__][INFO] - Number of regex retries in iteration 394: 0 [2025-11-13 10:32:40,885][__main__][INFO] - agents played in iteration 394 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:32:41,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:41,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:41,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:41,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:41,502][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:41,503][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:42,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:42,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:43,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:43,858][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:44,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:44,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:45,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:45,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:45,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:46,457][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:46,786][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:47,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:47,767][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:49,068][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:49,394][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:49,719][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:50,695][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:51,344][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:51,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:52,319][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:52,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:53,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:54,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:54,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:54,122][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:55,157][__main__][INFO] - Iteration 395 took 22s (37.31% Gen, 58.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 29m 17s. Estimated total time: 18h 58m 24s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 44s. [2025-11-13 10:32:55,159][__main__][INFO] - Starting iteration 395. [2025-11-13 10:32:55,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:55,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:04,261][__main__][INFO] - Number of regex retries in iteration 395: 0 [2025-11-13 10:33:04,261][__main__][INFO] - agents played in iteration 395 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:33:04,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:04,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:04,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:04,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:04,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:04,839][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:05,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:05,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:06,252][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:08,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:08,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:09,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:10,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:10,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:10,807][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:11,788][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:12,115][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:12,443][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:12,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:13,095][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:14,075][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:14,400][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:15,380][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:15,705][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:16,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:16,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:17,520][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:17,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:17,523][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:18,512][__main__][INFO] - Iteration 396 took 23s (38.96% Gen, 56.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 58m 2s. Estimated total time: 19h 27m 32s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 35s. [2025-11-13 10:33:18,514][__main__][INFO] - Starting iteration 396. [2025-11-13 10:33:18,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:33:18,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:27,594][__main__][INFO] - Number of regex retries in iteration 396: 0 [2025-11-13 10:33:27,594][__main__][INFO] - agents played in iteration 396 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:33:28,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:28,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:28,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:28,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:28,176][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:28,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:28,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:29,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:30,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:30,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:31,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:31,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:31,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:32,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:32,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:33,485][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:34,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:34,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:35,451][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:35,779][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:36,763][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:37,089][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:38,067][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:38,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:39,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:39,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:40,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:40,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:40,823][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:40,825][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:41,834][__main__][INFO] - Iteration 397 took 23s (38.92% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 55m 57s. Estimated total time: 19h 25m 50s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 18s. [2025-11-13 10:33:41,836][__main__][INFO] - Starting iteration 397. [2025-11-13 10:33:41,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:33:41,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:50,315][__main__][INFO] - Number of regex retries in iteration 397: 0 [2025-11-13 10:33:50,316][__main__][INFO] - agents played in iteration 397 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:33:50,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:50,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:50,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:50,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:50,884][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:50,885][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:51,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:51,956][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:52,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:52,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:52,932][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:54,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:55,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:55,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:56,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:56,833][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:57,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:57,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:58,142][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:58,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:58,792][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:59,118][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:59,444][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:00,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:00,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:01,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:01,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:02,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:02,792][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:03,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:03,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:03,513][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:04,436][__main__][INFO] - Iteration 398 took 22s (37.50% Gen, 58.40% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 19m 38s. Estimated total time: 18h 49m 53s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 39s, 500 more iterations: 3h 8m 18s. [2025-11-13 10:34:04,438][__main__][INFO] - Starting iteration 398. [2025-11-13 10:34:04,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:34:04,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:13,373][__main__][INFO] - Number of regex retries in iteration 398: 0 [2025-11-13 10:34:13,374][__main__][INFO] - agents played in iteration 398 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:34:13,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:13,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:13,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:13,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:13,987][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:13,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:14,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:15,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:15,409][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:16,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:16,386][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:16,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:17,034][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:17,361][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:17,686][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:18,011][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:18,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:18,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:18,988][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:19,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:19,639][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:19,965][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:20,291][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:20,945][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:21,273][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:21,598][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:21,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:22,257][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:22,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:22,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:23,243][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:23,897][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:24,223][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:24,549][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:24,876][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:25,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:25,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:26,706][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:26,715][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:26,716][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:27,651][__main__][INFO] - Iteration 399 took 23s (38.48% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 49m 51s. Estimated total time: 19h 20m 30s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 25s. [2025-11-13 10:34:27,653][__main__][INFO] - Starting iteration 399. [2025-11-13 10:34:27,656][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:34:27,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:36,739][__main__][INFO] - Number of regex retries in iteration 399: 0 [2025-11-13 10:34:36,739][__main__][INFO] - agents played in iteration 399 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:34:37,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:37,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:37,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:37,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:37,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:37,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:38,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:38,744][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:39,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:39,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:40,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:40,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:40,695][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:41,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:41,671][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:41,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:42,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:43,620][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:43,944][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:44,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:44,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:44,923][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:45,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:45,576][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:46,230][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:46,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:46,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:47,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:48,188][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:48,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:49,292][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:50,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:50,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:50,020][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:51,003][__main__][INFO] - Iteration 400 took 23s (38.90% Gen, 56.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 23s. Estimated total time: 19h 27m 25s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 34s. [2025-11-13 10:34:51,006][__main__][INFO] - Starting iteration 400. [2025-11-13 10:34:51,010][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:34:51,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:00,129][__main__][INFO] - Number of regex retries in iteration 400: 0 [2025-11-13 10:35:00,129][__main__][INFO] - agents played in iteration 400 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:35:00,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:00,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:00,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:00,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:00,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:00,719][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:01,492][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:02,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:03,090][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:03,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:04,066][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:04,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:04,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:05,044][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:05,369][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:05,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:06,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:06,344][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:06,995][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:07,321][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:07,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:07,970][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:08,296][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:08,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:09,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:09,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:10,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:10,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:11,563][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:11,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:12,687][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:13,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:13,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:13,413][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:15,284][__main__][INFO] - Iteration 401 took 24s (37.57% Gen, 54.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 20s. Estimated total time: 20h 13m 47s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 27s, 500 more iterations: 3h 22m 17s. [2025-11-13 10:35:15,286][__main__][INFO] - Starting iteration 401. [2025-11-13 10:35:15,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:15,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:24,008][__main__][INFO] - Number of regex retries in iteration 401: 0 [2025-11-13 10:35:24,008][__main__][INFO] - agents played in iteration 401 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:35:24,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:24,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:24,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:24,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:24,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:24,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:25,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:25,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:25,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:26,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:26,639][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:26,965][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:27,292][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:27,617][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:27,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:28,267][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:28,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:28,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:29,567][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:29,893][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:30,218][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:30,542][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:31,194][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:31,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:31,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:32,495][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:32,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:33,146][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:33,471][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:34,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:34,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:34,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:35,101][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:35,425][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:35,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:36,518][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:37,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:37,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:37,251][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:38,233][__main__][INFO] - Iteration 402 took 22s (38.00% Gen, 57.72% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 35m 24s. Estimated total time: 19h 7m 14s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 12s. [2025-11-13 10:35:38,235][__main__][INFO] - Starting iteration 402. [2025-11-13 10:35:38,239][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:38,240][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:47,777][__main__][INFO] - Number of regex retries in iteration 402: 0 [2025-11-13 10:35:47,778][__main__][INFO] - agents played in iteration 402 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:35:48,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:48,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:48,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:48,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:48,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:48,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:49,123][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:49,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:50,073][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:50,407][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:51,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:51,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:51,709][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:52,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:52,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:53,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:54,303][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:54,620][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:54,945][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:55,270][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:55,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:56,244][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:57,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:57,546][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:57,870][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:58,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:58,524][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:58,845][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:59,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:00,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:00,980][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:00,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:00,983][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:01,966][__main__][INFO] - Iteration 403 took 23s (40.20% Gen, 55.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 10s. Estimated total time: 19h 46m 23s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 43s. [2025-11-13 10:36:01,968][__main__][INFO] - Starting iteration 403. [2025-11-13 10:36:01,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:01,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:10,532][__main__][INFO] - Number of regex retries in iteration 403: 0 [2025-11-13 10:36:10,533][__main__][INFO] - agents played in iteration 403 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:36:10,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:11,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:11,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:11,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:11,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:11,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:11,837][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:12,458][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:14,412][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:15,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:16,036][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:16,685][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:17,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:17,666][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:18,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:18,641][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:19,290][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:19,622][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:20,268][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:20,598][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:20,929][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:21,255][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:21,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:21,909][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:22,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:23,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:23,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:23,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:23,738][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:24,719][__main__][INFO] - Iteration 404 took 22s (37.63% Gen, 58.05% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 24m 52s. Estimated total time: 18h 57m 28s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 54s, 500 more iterations: 3h 9m 34s. [2025-11-13 10:36:24,722][__main__][INFO] - Starting iteration 404. [2025-11-13 10:36:24,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:24,726][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:33,563][__main__][INFO] - Number of regex retries in iteration 404: 0 [2025-11-13 10:36:33,564][__main__][INFO] - agents played in iteration 404 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:36:34,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:34,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:34,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:34,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:34,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:34,137][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:35,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:35,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:36,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:36,451][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:37,116][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:37,766][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:38,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:38,752][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:39,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:39,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:40,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:40,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:40,709][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:41,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:42,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:42,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:42,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:43,646][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:43,971][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:44,297][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:44,947][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:45,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:46,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:46,735][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:46,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:46,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:47,729][__main__][INFO] - Iteration 405 took 23s (38.42% Gen, 57.27% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 37m 17s. Estimated total time: 19h 10m 16s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 42s. [2025-11-13 10:36:47,732][__main__][INFO] - Starting iteration 405. [2025-11-13 10:36:47,734][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:47,735][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:57,019][__main__][INFO] - Number of regex retries in iteration 405: 0 [2025-11-13 10:36:57,020][__main__][INFO] - agents played in iteration 405 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:36:57,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:57,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:57,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:57,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:57,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:57,577][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:58,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:58,942][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:59,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:59,593][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:59,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:00,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:00,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:01,243][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:01,564][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:01,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:02,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:03,188][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:03,513][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:03,838][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:04,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:04,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:04,811][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:05,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:05,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:06,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:06,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:07,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:07,419][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:07,751][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:08,401][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:08,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:09,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:10,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:10,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:10,225][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:11,291][__main__][INFO] - Iteration 406 took 23s (39.41% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 4m 29s. Estimated total time: 19h 37m 51s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 18s. [2025-11-13 10:37:11,293][__main__][INFO] - Starting iteration 406. [2025-11-13 10:37:11,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:37:11,296][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:20,414][__main__][INFO] - Number of regex retries in iteration 406: 0 [2025-11-13 10:37:20,415][__main__][INFO] - agents played in iteration 406 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:37:20,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:20,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:20,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:20,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:20,978][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:20,978][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:21,721][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:22,017][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:22,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:22,992][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:23,317][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:23,644][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:23,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:24,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:24,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:25,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:26,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:27,217][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:27,867][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:28,517][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:30,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:30,469][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:30,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:31,443][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:32,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:32,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:33,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:33,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:33,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:34,531][__main__][INFO] - Iteration 407 took 23s (39.24% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 3s. Estimated total time: 19h 21m 48s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 38s. [2025-11-13 10:37:34,533][__main__][INFO] - Starting iteration 407. [2025-11-13 10:37:34,535][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:37:34,536][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:43,618][__main__][INFO] - Number of regex retries in iteration 407: 0 [2025-11-13 10:37:43,618][__main__][INFO] - agents played in iteration 407 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:37:44,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:44,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:44,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:44,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:44,168][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:44,168][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:44,946][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:45,242][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:45,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:45,893][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:46,230][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:46,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:47,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:47,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:47,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:48,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:48,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:48,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:49,169][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:49,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:50,469][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:52,425][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:52,753][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:53,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:53,405][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:53,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:54,385][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:54,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:55,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:56,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:56,867][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:56,869][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:56,871][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:57,911][__main__][INFO] - Iteration 408 took 23s (38.85% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 54m 40s. Estimated total time: 19h 28m 49s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 48s. [2025-11-13 10:37:57,913][__main__][INFO] - Starting iteration 408. [2025-11-13 10:37:57,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:37:57,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:07,536][__main__][INFO] - Number of regex retries in iteration 408: 0 [2025-11-13 10:38:07,537][__main__][INFO] - agents played in iteration 408 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:38:07,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:08,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:08,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:08,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:08,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:08,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:08,865][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:09,160][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:09,491][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:09,816][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:10,141][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:10,468][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:10,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:11,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:11,777][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:12,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:13,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:13,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:14,054][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:14,378][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:15,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:15,355][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:15,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:16,004][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:16,329][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:16,981][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:17,308][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:17,634][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:17,961][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:18,289][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:18,617][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:18,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:19,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:20,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:20,738][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:20,739][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:20,741][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:21,764][__main__][INFO] - Iteration 409 took 23s (40.34% Gen, 55.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 55s. Estimated total time: 19h 52m 28s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 44s, 500 more iterations: 3h 18m 44s. [2025-11-13 10:38:21,767][__main__][INFO] - Starting iteration 409. [2025-11-13 10:38:21,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:38:21,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:31,166][__main__][INFO] - Number of regex retries in iteration 409: 0 [2025-11-13 10:38:31,166][__main__][INFO] - agents played in iteration 409 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:38:31,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:31,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:31,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:31,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:31,709][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:31,709][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:32,479][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:32,776][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:33,101][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:33,430][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:33,757][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:34,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:34,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:36,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:36,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:36,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:37,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:37,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:37,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:38,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:38,644][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:38,969][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:39,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:39,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:39,945][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:40,274][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:40,599][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:41,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:41,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:41,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:42,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:42,549][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:42,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:43,635][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:44,375][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:44,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:44,378][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:45,328][__main__][INFO] - Iteration 410 took 23s (39.88% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 57s. Estimated total time: 19h 37m 53s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 18s. [2025-11-13 10:38:45,330][__main__][INFO] - Starting iteration 410. [2025-11-13 10:38:45,334][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:38:45,334][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:53,914][__main__][INFO] - Number of regex retries in iteration 410: 0 [2025-11-13 10:38:53,915][__main__][INFO] - agents played in iteration 410 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:38:54,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:54,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:54,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:54,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:54,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:54,479][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:55,179][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:57,099][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:57,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:57,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:58,410][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:58,749][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:59,073][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:59,398][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:59,722][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:00,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:00,383][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:00,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:01,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:02,012][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:02,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:02,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:03,316][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:03,642][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:03,968][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:04,293][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:04,617][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:04,942][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:05,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:05,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:06,368][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:07,110][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:07,112][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:07,113][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:08,904][__main__][INFO] - Iteration 411 took 23s (36.40% Gen, 55.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 3m 14s. Estimated total time: 19h 38m 34s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 25s. [2025-11-13 10:39:08,906][__main__][INFO] - Starting iteration 411. [2025-11-13 10:39:08,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:08,910][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:18,113][__main__][INFO] - Number of regex retries in iteration 411: 0 [2025-11-13 10:39:18,114][__main__][INFO] - agents played in iteration 411 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:39:18,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:18,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:18,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:18,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:18,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:18,692][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:19,414][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:19,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:20,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:20,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:21,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:21,344][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:21,671][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:21,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:22,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:22,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:22,980][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:23,307][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:23,633][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:23,957][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:24,282][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:24,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:24,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:25,256][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:26,230][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:26,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:27,209][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:27,533][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:28,184][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:28,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:29,167][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:29,486][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:29,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:30,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:31,325][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:31,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:31,328][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:32,312][__main__][INFO] - Iteration 412 took 23s (39.32% Gen, 56.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 54m 28s. Estimated total time: 19h 30m 11s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 1s. [2025-11-13 10:39:32,314][__main__][INFO] - Starting iteration 412. [2025-11-13 10:39:32,317][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:32,318][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:41,647][__main__][INFO] - Number of regex retries in iteration 412: 0 [2025-11-13 10:39:41,648][__main__][INFO] - agents played in iteration 412 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:39:42,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:42,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:42,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:42,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:42,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:42,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:42,906][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:43,201][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:43,525][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:44,174][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:44,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:44,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:45,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:45,481][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:46,465][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:46,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:47,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:47,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:49,071][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:50,045][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:50,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:51,019][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:52,000][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:52,326][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:52,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:52,978][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:53,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:54,071][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:54,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:54,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:54,793][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:55,752][__main__][INFO] - Iteration 413 took 23s (39.81% Gen, 56.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 55m 42s. Estimated total time: 19h 31m 49s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 18s. [2025-11-13 10:39:55,754][__main__][INFO] - Starting iteration 413. [2025-11-13 10:39:55,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:55,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:04,472][__main__][INFO] - Number of regex retries in iteration 413: 0 [2025-11-13 10:40:04,472][__main__][INFO] - agents played in iteration 413 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:40:04,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:04,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:04,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:05,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:05,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:05,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:05,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:06,365][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:07,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:07,668][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:07,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:08,653][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:09,970][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:10,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:10,952][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:11,281][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:11,605][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:11,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:12,905][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:13,230][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:13,880][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:14,529][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:14,855][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:15,506][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:16,157][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:16,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:17,681][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:17,683][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:17,685][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:18,610][__main__][INFO] - Iteration 414 took 22s (38.13% Gen, 57.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 26m 11s. Estimated total time: 19h 2m 41s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 5s, 500 more iterations: 3h 10m 26s. [2025-11-13 10:40:18,612][__main__][INFO] - Starting iteration 414. [2025-11-13 10:40:18,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:18,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:28,384][__main__][INFO] - Number of regex retries in iteration 414: 0 [2025-11-13 10:40:28,385][__main__][INFO] - agents played in iteration 414 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:40:28,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:28,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:28,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:28,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:28,930][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:28,931][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:29,650][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:29,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:30,269][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:30,918][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:31,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:31,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:31,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:32,226][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:32,555][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:33,533][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:34,182][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:34,508][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:35,808][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:36,133][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:36,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:37,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:37,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:38,415][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:38,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:39,077][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:39,404][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:40,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:40,842][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:41,783][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:41,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:41,786][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:42,708][__main__][INFO] - Iteration 415 took 24s (40.54% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 45s. Estimated total time: 20h 4m 39s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 46s. [2025-11-13 10:40:42,710][__main__][INFO] - Starting iteration 415. [2025-11-13 10:40:42,713][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:42,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:51,440][__main__][INFO] - Number of regex retries in iteration 415: 0 [2025-11-13 10:40:51,440][__main__][INFO] - agents played in iteration 415 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:40:51,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:51,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:51,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:51,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:51,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:51,995][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:52,705][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:53,000][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:53,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:53,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:54,297][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:54,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:54,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:55,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:55,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:55,925][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:56,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:56,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:57,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:57,881][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:58,206][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:58,532][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:58,857][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:59,183][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:59,508][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:59,833][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:00,807][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:01,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:01,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:02,109][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:02,434][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:02,759][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:03,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:03,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:04,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:04,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:04,598][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:05,531][__main__][INFO] - Iteration 416 took 22s (38.24% Gen, 57.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 23m 39s. Estimated total time: 19h 0m 56s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 1s, 500 more iterations: 3h 10m 9s. [2025-11-13 10:41:05,533][__main__][INFO] - Starting iteration 416. [2025-11-13 10:41:05,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:41:05,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:15,119][__main__][INFO] - Number of regex retries in iteration 416: 0 [2025-11-13 10:41:15,120][__main__][INFO] - agents played in iteration 416 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:41:15,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:15,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:15,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:15,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:15,674][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:15,675][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:16,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:16,670][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:16,998][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:17,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:17,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:18,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:18,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:19,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:19,610][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:19,938][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:20,267][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:20,599][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:20,920][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:21,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:21,571][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:21,904][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:22,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:22,547][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:23,205][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:23,523][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:23,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:24,173][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:24,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:24,825][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:25,474][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:25,800][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:26,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:26,451][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:26,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:27,557][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:28,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:28,259][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:28,261][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:29,216][__main__][INFO] - Iteration 417 took 23s (40.47% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 22s. Estimated total time: 19h 44m 2s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 20s. [2025-11-13 10:41:29,218][__main__][INFO] - Starting iteration 417. [2025-11-13 10:41:29,221][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:41:29,221][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:38,235][__main__][INFO] - Number of regex retries in iteration 417: 0 [2025-11-13 10:41:38,236][__main__][INFO] - agents played in iteration 417 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:41:38,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:38,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:38,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:38,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:38,793][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:38,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:39,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:40,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:40,787][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:41,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:41,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:42,090][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:42,416][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:44,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:44,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:45,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:45,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:46,327][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:47,635][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:48,291][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:48,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:49,266][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:49,598][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:49,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:50,699][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:51,442][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:51,444][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:51,446][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:52,366][__main__][INFO] - Iteration 418 took 23s (38.94% Gen, 57.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 39m 16s. Estimated total time: 19h 17m 19s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 53s. [2025-11-13 10:41:52,369][__main__][INFO] - Starting iteration 418. [2025-11-13 10:41:52,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:41:52,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:02,077][__main__][INFO] - Number of regex retries in iteration 418: 0 [2025-11-13 10:42:02,077][__main__][INFO] - agents played in iteration 418 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:42:02,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:02,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:02,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:02,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:02,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:02,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:03,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:04,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:04,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:05,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:05,735][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:06,060][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:06,384][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:06,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:07,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:07,684][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:08,015][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:08,660][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:09,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:09,636][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:10,619][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:10,937][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:11,587][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:11,918][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:12,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:12,562][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:12,887][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:13,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:13,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:14,646][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:15,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:15,399][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:15,401][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:16,350][__main__][INFO] - Iteration 419 took 23s (40.47% Gen, 55.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 29s. Estimated total time: 19h 58m 56s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 49s. [2025-11-13 10:42:16,352][__main__][INFO] - Starting iteration 419. [2025-11-13 10:42:16,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:42:16,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:25,849][__main__][INFO] - Number of regex retries in iteration 419: 0 [2025-11-13 10:42:25,849][__main__][INFO] - agents played in iteration 419 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:42:26,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:26,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:26,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:26,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:26,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:26,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:27,116][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:27,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:27,741][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:28,072][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:28,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:28,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:29,054][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:29,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:30,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:31,028][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:31,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:31,682][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:32,013][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:32,339][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:33,317][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:33,641][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:33,966][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:34,297][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:34,943][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:36,243][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:36,569][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:37,218][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:37,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:38,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:39,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:39,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:39,044][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:40,006][__main__][INFO] - Iteration 420 took 23s (40.14% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 43s. Estimated total time: 19h 42m 35s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 5s. [2025-11-13 10:42:40,008][__main__][INFO] - Starting iteration 420. [2025-11-13 10:42:40,012][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:42:40,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:49,559][__main__][INFO] - Number of regex retries in iteration 420: 0 [2025-11-13 10:42:49,559][__main__][INFO] - agents played in iteration 420 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:42:50,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:50,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:50,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:50,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:50,115][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:50,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:51,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:52,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:52,416][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:52,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:53,392][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:54,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:55,024][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:55,349][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:55,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:55,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:56,325][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:56,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:56,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:57,303][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:57,956][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:58,281][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:58,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:58,933][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:59,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:59,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:59,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:00,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:00,559][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:01,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:01,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:02,724][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:02,726][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:02,727][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:04,491][__main__][INFO] - Iteration 421 took 24s (39.00% Gen, 53.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 46s. Estimated total time: 20h 24m 2s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 48s, 500 more iterations: 3h 24m 0s. [2025-11-13 10:43:04,493][__main__][INFO] - Starting iteration 421. [2025-11-13 10:43:04,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:04,496][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:14,353][__main__][INFO] - Number of regex retries in iteration 421: 0 [2025-11-13 10:43:14,353][__main__][INFO] - agents played in iteration 421 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:43:14,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:14,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:14,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:14,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:14,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:14,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:15,911][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:16,560][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:16,884][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:17,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:17,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:18,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:18,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:18,837][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:19,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:19,816][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:20,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:20,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:21,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:21,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:22,097][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:22,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:22,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:23,725][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:24,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:24,701][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:25,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:25,687][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:26,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:26,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:27,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:27,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:27,519][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:28,453][__main__][INFO] - Iteration 422 took 23s (41.14% Gen, 54.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 15s. Estimated total time: 19h 57m 55s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 55s, 500 more iterations: 3h 19m 39s. [2025-11-13 10:43:28,456][__main__][INFO] - Starting iteration 422. [2025-11-13 10:43:28,458][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:28,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:38,306][__main__][INFO] - Number of regex retries in iteration 422: 0 [2025-11-13 10:43:38,306][__main__][INFO] - agents played in iteration 422 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:43:38,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:38,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:38,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:38,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:38,889][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:38,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:39,597][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:40,220][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:41,199][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:41,852][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:42,179][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:42,506][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:42,832][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:43,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:44,140][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:45,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:45,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:45,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:46,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:46,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:46,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:47,398][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:48,057][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:48,708][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:49,359][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:50,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:50,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:51,571][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:51,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:51,574][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:52,488][__main__][INFO] - Iteration 423 took 24s (40.98% Gen, 55.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 26s. Estimated total time: 20h 1m 29s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 2s, 500 more iterations: 3h 20m 14s. [2025-11-13 10:43:52,490][__main__][INFO] - Starting iteration 423. [2025-11-13 10:43:52,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:52,493][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:02,312][__main__][INFO] - Number of regex retries in iteration 423: 0 [2025-11-13 10:44:02,313][__main__][INFO] - agents played in iteration 423 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:44:02,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:02,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:02,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:02,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:02,861][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:02,861][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:03,579][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:03,875][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:04,202][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:04,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:04,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:05,180][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:05,510][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:05,837][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:06,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:06,492][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:07,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:07,468][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:08,451][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:09,105][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:09,432][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:09,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:10,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:10,420][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:11,071][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:11,396][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:11,727][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:12,053][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:12,378][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:12,703][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:13,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:13,357][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:14,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:14,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:15,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:15,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:15,551][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:16,605][__main__][INFO] - Iteration 424 took 24s (40.72% Gen, 54.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 25m 11s. Estimated total time: 20h 5m 39s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 11s, 500 more iterations: 3h 20m 56s. [2025-11-13 10:44:16,607][__main__][INFO] - Starting iteration 424. [2025-11-13 10:44:16,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:16,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:26,210][__main__][INFO] - Number of regex retries in iteration 424: 0 [2025-11-13 10:44:26,211][__main__][INFO] - agents played in iteration 424 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:44:26,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:26,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:26,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:26,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:26,764][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:26,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:27,503][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:27,799][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:28,125][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:28,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:28,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:29,105][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:29,432][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:29,757][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:30,085][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:31,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:31,387][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:31,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:32,369][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:32,694][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:33,019][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:33,994][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:34,646][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:34,974][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:35,630][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:35,955][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:36,606][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:36,933][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:37,260][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:37,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:37,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:38,687][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:39,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:39,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:39,430][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:40,340][__main__][INFO] - Iteration 425 took 23s (40.45% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 5m 42s. Estimated total time: 19h 46m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 45s. [2025-11-13 10:44:40,342][__main__][INFO] - Starting iteration 425. [2025-11-13 10:44:40,345][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:40,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:49,301][__main__][INFO] - Number of regex retries in iteration 425: 0 [2025-11-13 10:44:49,301][__main__][INFO] - agents played in iteration 425 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:44:49,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:49,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:49,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:49,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:49,856][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:49,857][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:50,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:50,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:51,186][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:51,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:51,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:52,173][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:52,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:53,149][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:53,477][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:53,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:54,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:54,455][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:54,781][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:56,082][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:56,409][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:56,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:57,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:57,718][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:58,043][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:58,705][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:59,031][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:59,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:00,659][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:00,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:01,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:02,511][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:02,513][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:02,514][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:03,427][__main__][INFO] - Iteration 426 took 23s (38.80% Gen, 57.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 32m 53s. Estimated total time: 19h 14m 7s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 21s. [2025-11-13 10:45:03,429][__main__][INFO] - Starting iteration 426. [2025-11-13 10:45:03,432][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:45:03,432][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:12,784][__main__][INFO] - Number of regex retries in iteration 426: 0 [2025-11-13 10:45:12,784][__main__][INFO] - agents played in iteration 426 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:45:13,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:13,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:13,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:13,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:13,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:13,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:14,082][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:14,378][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:15,029][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:15,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:15,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:16,007][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:16,333][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:16,659][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:17,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:17,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:17,973][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:18,298][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:18,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:18,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:19,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:19,606][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:20,261][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:20,586][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:20,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:21,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:21,887][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:23,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:23,845][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:24,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:24,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:25,269][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:25,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:25,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:25,967][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:26,934][__main__][INFO] - Iteration 427 took 23s (39.79% Gen, 56.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 53m 30s. Estimated total time: 19h 35m 8s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 51s. [2025-11-13 10:45:26,936][__main__][INFO] - Starting iteration 427. [2025-11-13 10:45:26,939][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:45:26,939][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:35,986][__main__][INFO] - Number of regex retries in iteration 427: 0 [2025-11-13 10:45:35,987][__main__][INFO] - agents played in iteration 427 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:45:36,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:36,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:36,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:36,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:36,539][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:36,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:37,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:38,236][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:38,563][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:38,887][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:39,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:39,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:40,195][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:40,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:40,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:41,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:41,825][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:42,151][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:42,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:43,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:43,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:43,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:44,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:44,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:45,080][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:45,405][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:45,736][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:46,386][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:46,712][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:47,046][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:47,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:47,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:48,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:49,212][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:49,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:49,215][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:50,146][__main__][INFO] - Iteration 428 took 23s (38.98% Gen, 57.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 38m 22s. Estimated total time: 19h 20m 24s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 24s. [2025-11-13 10:45:50,148][__main__][INFO] - Starting iteration 428. [2025-11-13 10:45:50,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:45:50,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:59,889][__main__][INFO] - Number of regex retries in iteration 428: 0 [2025-11-13 10:45:59,889][__main__][INFO] - agents played in iteration 428 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:46:00,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:00,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:00,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:00,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:00,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:00,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:01,210][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:01,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:02,165][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:02,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:03,478][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:03,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:04,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:04,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:05,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:06,408][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:07,058][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:07,384][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:08,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:08,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:08,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:09,010][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:09,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:09,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:10,311][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:10,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:11,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:11,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:12,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:13,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:13,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:13,140][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:14,172][__main__][INFO] - Iteration 429 took 24s (40.54% Gen, 55.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 42s. Estimated total time: 20h 1m 8s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 2s, 500 more iterations: 3h 20m 11s. [2025-11-13 10:46:14,175][__main__][INFO] - Starting iteration 429. [2025-11-13 10:46:14,177][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:46:14,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:23,613][__main__][INFO] - Number of regex retries in iteration 429: 0 [2025-11-13 10:46:23,613][__main__][INFO] - agents played in iteration 429 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:46:24,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:24,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:24,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:24,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:24,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:24,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:24,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:25,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:25,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:25,911][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:26,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:26,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:26,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:27,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:27,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:27,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:28,193][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:28,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:28,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:30,469][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:30,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:31,121][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:31,450][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:31,778][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:32,751][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:33,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:33,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:34,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:34,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:34,700][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:35,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:35,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:36,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:36,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:36,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:36,917][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:37,903][__main__][INFO] - Iteration 430 took 23s (39.76% Gen, 56.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 29s. Estimated total time: 19h 46m 18s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 43s. [2025-11-13 10:46:37,905][__main__][INFO] - Starting iteration 430. [2025-11-13 10:46:37,908][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:46:37,908][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:46,944][__main__][INFO] - Number of regex retries in iteration 430: 0 [2025-11-13 10:46:46,944][__main__][INFO] - agents played in iteration 430 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:46:47,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:47,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:47,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:47,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:47,527][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:47,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:48,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:48,593][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:48,918][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:49,243][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:49,570][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:50,220][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:50,875][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:51,203][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:51,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:52,185][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:52,510][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:52,839][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:53,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:53,815][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:54,466][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:54,791][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:55,121][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:55,771][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:56,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:56,423][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:56,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:58,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:59,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:00,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:00,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:00,225][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:02,185][__main__][INFO] - Iteration 431 took 24s (37.21% Gen, 54.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 42s. Estimated total time: 20h 13m 56s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 27s, 500 more iterations: 3h 22m 19s. [2025-11-13 10:47:02,187][__main__][INFO] - Starting iteration 431. [2025-11-13 10:47:02,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:02,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:11,813][__main__][INFO] - Number of regex retries in iteration 431: 0 [2025-11-13 10:47:11,814][__main__][INFO] - agents played in iteration 431 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:47:12,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:12,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:12,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:12,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:12,385][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:12,385][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:13,134][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:13,429][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:14,079][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:14,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:15,061][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:16,045][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:16,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:16,696][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:17,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:17,347][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:17,674][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:17,999][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:18,324][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:18,649][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:19,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:19,948][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:20,272][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:20,601][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:20,926][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:21,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:22,226][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:22,552][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:22,879][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:23,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:23,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:24,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:25,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:25,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:25,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:26,156][__main__][INFO] - Iteration 432 took 23s (40.15% Gen, 55.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 42s. Estimated total time: 19h 58m 20s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 43s. [2025-11-13 10:47:26,158][__main__][INFO] - Starting iteration 432. [2025-11-13 10:47:26,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:26,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:35,681][__main__][INFO] - Number of regex retries in iteration 432: 0 [2025-11-13 10:47:35,682][__main__][INFO] - agents played in iteration 432 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:47:36,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:36,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:36,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:36,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:36,282][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:36,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:37,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:37,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:37,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:37,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:38,650][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:39,960][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:40,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:40,610][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:40,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:41,261][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:42,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:42,560][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:42,891][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:43,219][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:43,544][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:44,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:44,518][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:44,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:45,170][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:45,496][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:45,821][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:46,146][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:46,471][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:47,121][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:47,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:48,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:48,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:48,933][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:48,934][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:49,988][__main__][INFO] - Iteration 433 took 23s (39.95% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 7m 22s. Estimated total time: 19h 51m 23s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 33s. [2025-11-13 10:47:49,990][__main__][INFO] - Starting iteration 433. [2025-11-13 10:47:49,993][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:49,993][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:59,333][__main__][INFO] - Number of regex retries in iteration 433: 0 [2025-11-13 10:47:59,334][__main__][INFO] - agents played in iteration 433 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:47:59,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:59,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:59,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:59,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:59,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:59,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:00,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:01,268][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:02,241][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:02,895][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:03,223][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:03,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:04,198][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:04,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:04,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:05,501][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:05,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:06,482][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:06,802][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:07,454][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:07,787][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:08,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:08,752][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:09,084][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:09,727][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:10,052][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:10,383][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:11,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:11,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:12,517][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:12,519][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:12,521][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:13,572][__main__][INFO] - Iteration 434 took 23s (39.61% Gen, 55.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 54m 34s. Estimated total time: 19h 38m 59s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 29s. [2025-11-13 10:48:13,574][__main__][INFO] - Starting iteration 434. [2025-11-13 10:48:13,577][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:13,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:22,807][__main__][INFO] - Number of regex retries in iteration 434: 0 [2025-11-13 10:48:22,807][__main__][INFO] - agents played in iteration 434 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:48:23,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:23,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:23,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:23,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:23,726][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:23,726][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:24,754][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:25,084][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:25,409][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:25,735][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:26,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:26,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:27,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:27,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:27,702][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:28,027][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:28,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:29,002][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:29,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:29,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:29,988][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:31,282][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:31,607][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:31,934][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:32,260][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:32,910][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:33,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:33,884][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:34,212][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:34,539][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:34,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:35,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:36,361][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:36,362][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:36,364][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:37,402][__main__][INFO] - Iteration 435 took 23s (38.74% Gen, 56.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 28s. Estimated total time: 19h 51m 17s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 32s. [2025-11-13 10:48:37,404][__main__][INFO] - Starting iteration 435. [2025-11-13 10:48:37,407][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:37,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:46,293][__main__][INFO] - Number of regex retries in iteration 435: 0 [2025-11-13 10:48:46,294][__main__][INFO] - agents played in iteration 435 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:48:46,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:46,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:46,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:46,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:46,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:46,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:47,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:47,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:48,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:48,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:48,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:49,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:49,517][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:49,841][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:50,166][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:50,494][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:51,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:51,470][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:51,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:52,120][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:52,446][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:53,096][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:53,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:54,069][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:55,048][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:55,373][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:55,698][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:56,347][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:56,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:57,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:58,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:59,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:59,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:59,477][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:00,482][__main__][INFO] - Iteration 436 took 23s (38.51% Gen, 57.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 28m 36s. Estimated total time: 19h 13m 47s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 17s. [2025-11-13 10:49:00,484][__main__][INFO] - Starting iteration 436. [2025-11-13 10:49:00,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:49:00,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:09,245][__main__][INFO] - Number of regex retries in iteration 436: 0 [2025-11-13 10:49:09,245][__main__][INFO] - agents played in iteration 436 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:49:09,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:10,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:10,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:10,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:10,135][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:10,135][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:10,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:11,176][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:11,501][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:11,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:12,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:12,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:12,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:13,132][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:13,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:13,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:14,776][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:15,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:15,752][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:16,079][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:16,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:16,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:17,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:17,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:18,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:18,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:18,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:19,005][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:19,330][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:19,655][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:19,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:20,305][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:20,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:20,955][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:21,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:22,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:22,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:22,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:22,784][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:23,819][__main__][INFO] - Iteration 437 took 23s (37.53% Gen, 58.03% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 41m 3s. Estimated total time: 19h 26m 38s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 26s. [2025-11-13 10:49:23,821][__main__][INFO] - Starting iteration 437. [2025-11-13 10:49:23,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:49:23,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:33,152][__main__][INFO] - Number of regex retries in iteration 437: 0 [2025-11-13 10:49:33,152][__main__][INFO] - agents played in iteration 437 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:49:33,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:33,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:33,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:33,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:33,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:33,699][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:34,449][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:34,745][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:36,054][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:36,712][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:37,362][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:38,342][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:38,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:38,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:39,318][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:39,645][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:39,969][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:40,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:40,626][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:40,945][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:41,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:41,596][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:41,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:42,571][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:42,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:43,220][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:44,195][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:44,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:44,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:45,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:46,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:46,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:46,348][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:47,368][__main__][INFO] - Iteration 438 took 23s (39.62% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 51m 15s. Estimated total time: 19h 37m 13s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 12s. [2025-11-13 10:49:47,370][__main__][INFO] - Starting iteration 438. [2025-11-13 10:49:47,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:49:47,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:56,712][__main__][INFO] - Number of regex retries in iteration 438: 0 [2025-11-13 10:49:56,712][__main__][INFO] - agents played in iteration 438 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:49:57,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:57,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:57,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:57,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:57,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:57,256][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:58,026][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:58,323][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:58,649][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:58,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:00,928][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:01,253][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:01,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:01,905][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:02,234][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:02,559][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:02,885][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:03,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:03,535][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:03,859][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:04,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:04,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:04,832][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:05,157][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:05,482][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:06,133][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:06,782][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:07,106][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:07,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:08,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:09,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:09,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:09,891][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:09,893][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:10,878][__main__][INFO] - Iteration 439 took 23s (39.73% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 52s. Estimated total time: 19h 35m 14s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 52s. [2025-11-13 10:50:10,880][__main__][INFO] - Starting iteration 439. [2025-11-13 10:50:10,884][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:50:10,884][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:19,969][__main__][INFO] - Number of regex retries in iteration 439: 0 [2025-11-13 10:50:19,969][__main__][INFO] - agents played in iteration 439 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:50:20,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:20,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:20,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:20,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:20,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:20,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:21,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:21,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:22,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:22,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:22,861][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:23,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:23,512][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:23,843][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:24,168][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:24,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:25,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:25,479][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:25,811][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:26,140][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:26,466][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:26,792][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:27,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:27,447][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:28,096][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:28,428][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:28,744][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:29,069][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:29,393][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:29,722][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:30,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:30,366][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:31,016][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:31,340][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:31,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:32,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:33,163][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:33,164][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:33,168][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:34,283][__main__][INFO] - Iteration 440 took 23s (38.82% Gen, 56.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 43m 15s. Estimated total time: 19h 30m 0s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 0s. [2025-11-13 10:50:34,285][__main__][INFO] - Starting iteration 440. [2025-11-13 10:50:34,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:50:34,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:43,874][__main__][INFO] - Number of regex retries in iteration 440: 0 [2025-11-13 10:50:43,874][__main__][INFO] - agents played in iteration 440 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:50:44,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:44,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:44,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:44,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:44,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:44,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:45,175][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:45,472][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:45,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:46,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:47,106][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:47,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:47,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:48,085][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:48,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:48,734][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:49,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:49,393][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:49,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:50,368][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:50,693][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:51,018][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:51,343][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:51,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:51,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:52,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:52,642][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:52,966][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:53,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:53,617][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:54,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:54,917][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:55,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:56,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:57,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:57,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:57,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:59,051][__main__][INFO] - Iteration 441 took 24s (38.70% Gen, 53.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 58s. Estimated total time: 20h 38m 8s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 16s, 500 more iterations: 3h 26m 21s. [2025-11-13 10:50:59,053][__main__][INFO] - Starting iteration 441. [2025-11-13 10:50:59,057][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:59,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:07,824][__main__][INFO] - Number of regex retries in iteration 441: 0 [2025-11-13 10:51:07,824][__main__][INFO] - agents played in iteration 441 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:51:08,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:08,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:08,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:08,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:08,380][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:08,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:09,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:09,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:09,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:10,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:10,392][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:11,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:11,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:11,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:13,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:13,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:13,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:14,304][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:14,954][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:15,278][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:16,258][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:16,583][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:16,908][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:17,233][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:17,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:17,883][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:18,208][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:18,534][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:18,859][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:19,183][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:19,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:20,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:20,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:20,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:21,000][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:21,974][__main__][INFO] - Iteration 442 took 22s (38.25% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 18m 18s. Estimated total time: 19h 5m 51s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 58s. [2025-11-13 10:51:21,976][__main__][INFO] - Starting iteration 442. [2025-11-13 10:51:21,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:21,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:30,873][__main__][INFO] - Number of regex retries in iteration 442: 0 [2025-11-13 10:51:30,874][__main__][INFO] - agents played in iteration 442 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:51:31,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:31,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:31,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:31,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:31,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:31,446][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:32,195][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:32,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:32,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:33,476][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:33,800][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:34,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:34,460][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:34,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:35,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:35,441][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:36,092][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:36,416][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:36,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:37,071][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:37,395][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:37,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:38,047][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:38,372][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:38,698][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:39,023][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:39,349][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:39,673][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:39,997][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:40,646][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:40,971][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:41,296][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:41,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:41,944][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:42,269][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:42,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:43,347][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:44,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:44,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:44,061][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:45,070][__main__][INFO] - Iteration 443 took 23s (38.52% Gen, 57.11% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 26m 40s. Estimated total time: 19h 14m 36s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 26s. [2025-11-13 10:51:45,072][__main__][INFO] - Starting iteration 443. [2025-11-13 10:51:45,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:45,076][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:54,477][__main__][INFO] - Number of regex retries in iteration 443: 0 [2025-11-13 10:51:54,477][__main__][INFO] - agents played in iteration 443 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:51:54,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:54,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:55,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:55,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:55,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:55,043][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:55,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:56,074][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:56,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:57,051][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:58,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:58,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:58,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:59,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:59,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:59,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:00,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:00,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:01,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:01,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:02,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:02,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:03,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:03,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:03,890][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:04,214][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:04,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:04,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:05,187][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:05,511][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:05,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:06,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:06,948][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:07,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:07,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:07,666][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:08,699][__main__][INFO] - Iteration 444 took 23s (39.79% Gen, 55.83% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 52m 54s. Estimated total time: 19h 41m 14s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 52s. [2025-11-13 10:52:08,701][__main__][INFO] - Starting iteration 444. [2025-11-13 10:52:08,705][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:08,705][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:17,772][__main__][INFO] - Number of regex retries in iteration 444: 0 [2025-11-13 10:52:17,773][__main__][INFO] - agents played in iteration 444 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:52:18,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:18,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:18,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:18,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:18,329][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:18,330][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:19,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:19,357][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:20,009][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:20,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:21,635][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:21,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:22,285][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:22,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:22,935][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:23,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:23,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:24,235][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:24,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:25,219][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:25,544][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:25,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:26,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:26,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:26,842][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:27,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:27,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:28,141][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:28,464][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:28,788][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:29,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:29,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:30,189][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:30,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:30,900][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:30,902][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:31,866][__main__][INFO] - Iteration 445 took 23s (39.15% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 29m 24s. Estimated total time: 19h 18m 8s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 1s. [2025-11-13 10:52:31,868][__main__][INFO] - Starting iteration 445. [2025-11-13 10:52:31,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:31,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:40,617][__main__][INFO] - Number of regex retries in iteration 445: 0 [2025-11-13 10:52:40,618][__main__][INFO] - agents played in iteration 445 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:52:41,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:41,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:41,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:41,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:41,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:41,192][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:41,944][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:42,239][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:42,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:42,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:43,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:44,192][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:44,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:44,840][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:45,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:45,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:46,473][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:46,800][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:47,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:47,452][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:47,780][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:48,762][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:49,081][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:49,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:49,731][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:51,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:52,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:52,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:53,117][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:53,831][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:53,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:53,835][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:54,766][__main__][INFO] - Iteration 446 took 22s (38.20% Gen, 57.73% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 15m 39s. Estimated total time: 19h 4m 45s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 47s. [2025-11-13 10:52:54,768][__main__][INFO] - Starting iteration 446. [2025-11-13 10:52:54,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:54,771][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:04,108][__main__][INFO] - Number of regex retries in iteration 446: 0 [2025-11-13 10:53:04,109][__main__][INFO] - agents played in iteration 446 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:53:04,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:04,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:04,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:04,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:04,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:04,685][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:05,720][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:06,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:06,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:06,700][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:07,027][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:07,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:08,004][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:08,658][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:09,308][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:09,960][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:10,610][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:10,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:11,260][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:11,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:12,575][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:12,891][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:13,541][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:13,875][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:14,191][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:14,840][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:15,169][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:15,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:16,587][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:17,293][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:17,295][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:17,296][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:18,271][__main__][INFO] - Iteration 447 took 23s (39.73% Gen, 56.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 45m 33s. Estimated total time: 19h 35m 3s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 50s. [2025-11-13 10:53:18,273][__main__][INFO] - Starting iteration 447. [2025-11-13 10:53:18,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:53:18,278][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:26,379][__main__][INFO] - Number of regex retries in iteration 447: 0 [2025-11-13 10:53:26,380][__main__][INFO] - agents played in iteration 447 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:53:26,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:26,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:26,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:26,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:26,939][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:26,939][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:28,658][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:28,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:29,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:29,634][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:30,283][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:30,608][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:30,932][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:31,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:31,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:32,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:32,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:32,893][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:33,217][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:33,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:33,875][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:34,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:34,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:35,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:36,160][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:36,492][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:37,135][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:37,459][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:38,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:38,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:39,588][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:39,590][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:39,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:40,528][__main__][INFO] - Iteration 448 took 22s (36.41% Gen, 59.37% Train). Generation: 8s, Training: 13s. Estimated remaining time: 15h 42m 45s. Estimated total time: 18h 32m 36s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 5s, 500 more iterations: 3h 5m 26s. [2025-11-13 10:53:40,530][__main__][INFO] - Starting iteration 448. [2025-11-13 10:53:40,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:53:40,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:49,894][__main__][INFO] - Number of regex retries in iteration 448: 0 [2025-11-13 10:53:49,895][__main__][INFO] - agents played in iteration 448 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:53:50,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:50,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:50,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:50,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:50,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:50,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:51,510][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:51,836][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:52,813][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:53,788][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:54,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:55,421][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:55,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:56,727][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:57,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:57,391][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:57,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:58,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:58,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:59,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:59,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:59,997][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:00,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:01,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:02,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:03,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:03,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:03,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:04,071][__main__][INFO] - Iteration 449 took 23s (39.77% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 46m 38s. Estimated total time: 19h 36m 53s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 13s, 500 more iterations: 3h 16m 8s. [2025-11-13 10:54:04,078][__main__][INFO] - Starting iteration 449. [2025-11-13 10:54:04,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:54:04,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:13,201][__main__][INFO] - Number of regex retries in iteration 449: 0 [2025-11-13 10:54:13,202][__main__][INFO] - agents played in iteration 449 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:54:13,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:13,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:13,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:13,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:13,776][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:13,777][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:14,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:15,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:15,454][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:15,786][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:16,112][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:16,437][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:17,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:17,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:17,749][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:18,076][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:18,404][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:18,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:19,062][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:19,388][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:19,712][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:20,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:20,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:21,013][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:21,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:21,663][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:21,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:22,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:22,641][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:23,298][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:23,950][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:24,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:24,600][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:24,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:25,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:26,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:26,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:26,438][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:27,356][__main__][INFO] - Iteration 450 took 23s (39.17% Gen, 56.87% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 33m 3s. Estimated total time: 19h 23m 42s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 57s. [2025-11-13 10:54:27,358][__main__][INFO] - Starting iteration 450. [2025-11-13 10:54:27,361][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:54:27,361][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:35,861][__main__][INFO] - Number of regex retries in iteration 450: 0 [2025-11-13 10:54:35,861][__main__][INFO] - agents played in iteration 450 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:54:36,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:36,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:36,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:36,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:36,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:36,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:37,180][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:37,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:37,800][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:38,126][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:38,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:39,118][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:39,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:39,774][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:40,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:41,088][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:41,744][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:42,401][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:42,727][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:43,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:44,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:44,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:44,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:45,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:45,671][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:46,326][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:46,650][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:47,315][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:47,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:48,409][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:49,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:49,136][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:49,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:50,930][__main__][INFO] - Iteration 451 took 23s (36.06% Gen, 56.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 47m 29s. Estimated total time: 19h 38m 31s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 25s. [2025-11-13 10:54:50,932][__main__][INFO] - Starting iteration 451. [2025-11-13 10:54:50,935][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:50,935][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:59,500][__main__][INFO] - Number of regex retries in iteration 451: 0 [2025-11-13 10:54:59,501][__main__][INFO] - agents played in iteration 451 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:54:59,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:59,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:00,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:00,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:00,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:00,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:00,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:01,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:01,427][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:01,752][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:02,090][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:02,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:02,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:03,395][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:04,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:04,699][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:05,024][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:05,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:05,673][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:06,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:06,974][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:07,299][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:07,947][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:08,270][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:08,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:08,923][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:09,250][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:09,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:09,904][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:10,228][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:10,555][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:10,880][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:11,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:11,971][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:12,708][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:12,709][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:12,711][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:13,646][__main__][INFO] - Iteration 452 took 22s (37.71% Gen, 58.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 4m 11s. Estimated total time: 18h 55m 36s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 51s, 500 more iterations: 3h 9m 16s. [2025-11-13 10:55:13,648][__main__][INFO] - Starting iteration 452. [2025-11-13 10:55:13,651][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:13,651][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:22,919][__main__][INFO] - Number of regex retries in iteration 452: 0 [2025-11-13 10:55:22,920][__main__][INFO] - agents played in iteration 452 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:55:23,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:23,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:23,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:23,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:23,509][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:23,510][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:24,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:24,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:25,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:25,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:25,925][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:26,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:26,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:27,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:28,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:28,879][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:30,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:30,524][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:31,174][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:31,501][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:32,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:32,478][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:32,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:33,126][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:33,457][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:33,782][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:34,107][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:34,761][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:35,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:35,865][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:36,600][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:36,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:36,604][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:37,559][__main__][INFO] - Iteration 453 took 23s (38.77% Gen, 57.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 37s. Estimated total time: 19h 55m 26s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 14s. [2025-11-13 10:55:37,561][__main__][INFO] - Starting iteration 453. [2025-11-13 10:55:37,565][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:37,565][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:48,187][__main__][INFO] - Number of regex retries in iteration 453: 0 [2025-11-13 10:55:48,187][__main__][INFO] - agents played in iteration 453 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:55:48,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:48,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:48,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:48,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:48,761][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:48,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:49,798][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:50,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:50,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:50,773][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:51,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:51,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:52,073][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:54,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:54,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:55,356][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:55,682][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:56,329][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:56,654][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:56,982][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:57,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:57,976][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:58,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:58,628][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:59,279][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:59,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:00,704][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:01,437][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:01,439][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:01,440][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:02,363][__main__][INFO] - Iteration 454 took 24s (42.83% Gen, 53.44% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 47m 45s. Estimated total time: 20h 39m 59s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 19s, 500 more iterations: 3h 26m 39s. [2025-11-13 10:56:02,365][__main__][INFO] - Starting iteration 454. [2025-11-13 10:56:02,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:02,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:12,308][__main__][INFO] - Number of regex retries in iteration 454: 0 [2025-11-13 10:56:12,308][__main__][INFO] - agents played in iteration 454 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:56:12,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:12,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:12,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:12,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:12,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:12,899][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:13,631][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:13,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:14,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:14,898][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:15,223][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:16,196][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:16,523][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:16,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:17,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:17,502][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:18,481][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:19,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:20,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:20,439][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:20,766][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:21,093][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:21,418][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:21,742][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:22,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:23,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:24,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:24,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:25,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:25,508][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:25,510][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:26,459][__main__][INFO] - Iteration 455 took 24s (41.26% Gen, 54.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 57s. Estimated total time: 20h 4m 35s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 45s. [2025-11-13 10:56:26,461][__main__][INFO] - Starting iteration 455. [2025-11-13 10:56:26,464][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:26,465][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:36,741][__main__][INFO] - Number of regex retries in iteration 455: 0 [2025-11-13 10:56:36,742][__main__][INFO] - agents played in iteration 455 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:56:37,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:37,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:37,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:37,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:37,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:37,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:38,038][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:38,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:38,982][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:39,317][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:39,969][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:40,293][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:40,626][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:41,281][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:41,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:42,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:42,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:42,908][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:43,233][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:43,559][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:43,885][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:44,209][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:44,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:45,507][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:45,833][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:46,161][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:46,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:47,143][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:48,136][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:48,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:49,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:49,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:49,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:49,953][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:50,881][__main__][INFO] - Iteration 456 took 24s (42.09% Gen, 54.11% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 27m 50s. Estimated total time: 20h 20m 52s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 41s, 500 more iterations: 3h 23m 28s. [2025-11-13 10:56:50,883][__main__][INFO] - Starting iteration 456. [2025-11-13 10:56:50,886][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:50,887][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:00,880][__main__][INFO] - Number of regex retries in iteration 456: 0 [2025-11-13 10:57:00,880][__main__][INFO] - agents played in iteration 456 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:57:01,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:01,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:01,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:01,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:01,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:01,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:02,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:02,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:03,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:04,126][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:04,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:04,781][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:05,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:05,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:05,761][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:06,744][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:07,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:07,710][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:08,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:08,684][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:09,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:09,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:09,659][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:09,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:11,287][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:11,613][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:12,268][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:12,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:13,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:14,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:14,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:14,125][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:15,254][__main__][INFO] - Iteration 457 took 24s (41.01% Gen, 54.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 59s. Estimated total time: 20h 18m 26s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 4s. [2025-11-13 10:57:15,256][__main__][INFO] - Starting iteration 457. [2025-11-13 10:57:15,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:57:15,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:25,832][__main__][INFO] - Number of regex retries in iteration 457: 0 [2025-11-13 10:57:25,832][__main__][INFO] - agents played in iteration 457 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:57:26,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:26,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:26,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:26,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:26,403][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:26,404][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:28,072][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:28,400][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:28,726][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:29,051][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:29,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:29,707][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:30,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:30,365][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:30,682][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:31,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:31,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:31,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:31,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:34,280][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:34,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:34,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:35,581][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:36,889][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:37,218][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:37,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:38,327][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:39,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:39,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:39,127][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:40,037][__main__][INFO] - Iteration 458 took 24s (42.66% Gen, 53.66% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 45m 4s. Estimated total time: 20h 38m 55s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 17s, 500 more iterations: 3h 26m 29s. [2025-11-13 10:57:40,039][__main__][INFO] - Starting iteration 458. [2025-11-13 10:57:40,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:57:40,042][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:49,233][__main__][INFO] - Number of regex retries in iteration 458: 0 [2025-11-13 10:57:49,234][__main__][INFO] - agents played in iteration 458 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:57:49,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:49,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:49,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:49,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:49,790][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:49,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:50,548][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:50,845][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:51,173][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:51,504][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:52,163][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:52,494][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:52,826][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:53,150][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:53,475][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:54,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:54,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:54,780][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:55,104][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:55,753][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:56,402][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:57,051][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:58,025][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:58,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:59,659][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:00,316][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:00,643][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:00,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:01,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:02,443][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:02,445][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:02,447][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:03,379][__main__][INFO] - Iteration 459 took 23s (39.38% Gen, 56.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 32m 39s. Estimated total time: 19h 26m 54s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 29s. [2025-11-13 10:58:03,382][__main__][INFO] - Starting iteration 459. [2025-11-13 10:58:03,385][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:58:03,385][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:13,922][__main__][INFO] - Number of regex retries in iteration 459: 0 [2025-11-13 10:58:13,923][__main__][INFO] - agents played in iteration 459 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:58:14,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:14,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:14,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:14,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:14,488][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:14,488][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:15,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:16,480][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:17,130][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:17,454][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:17,779][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:18,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:18,427][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:18,750][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:19,075][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:19,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:20,048][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:20,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:20,700][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:21,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:21,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:21,675][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:22,322][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:22,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:22,976][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:23,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:23,623][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:23,948][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:24,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:24,924][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:25,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:25,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:26,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:27,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:27,041][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:27,043][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:27,948][__main__][INFO] - Iteration 460 took 24s (42.90% Gen, 53.41% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 33m 33s. Estimated total time: 20h 28m 12s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 56s, 500 more iterations: 3h 24m 42s. [2025-11-13 10:58:27,958][__main__][INFO] - Starting iteration 460. [2025-11-13 10:58:27,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:58:27,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:38,048][__main__][INFO] - Number of regex retries in iteration 460: 0 [2025-11-13 10:58:38,049][__main__][INFO] - agents played in iteration 460 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:58:38,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:38,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:38,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:38,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:38,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:38,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:39,353][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:39,648][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:39,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:40,299][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:40,624][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:41,274][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:41,599][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:41,923][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:42,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:42,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:43,220][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:43,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:44,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:44,518][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:44,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:45,167][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:45,492][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:46,465][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:46,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:48,432][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:48,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:49,083][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:49,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:50,519][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:51,266][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:51,267][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:51,269][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:53,074][__main__][INFO] - Iteration 461 took 25s (40.17% Gen, 52.64% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 0m 36s. Estimated total time: 20h 55m 40s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 51s, 500 more iterations: 3h 29m 16s. [2025-11-13 10:58:53,076][__main__][INFO] - Starting iteration 461. [2025-11-13 10:58:53,079][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:53,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:03,704][__main__][INFO] - Number of regex retries in iteration 461: 0 [2025-11-13 10:59:03,704][__main__][INFO] - agents played in iteration 461 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:59:04,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:04,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:04,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:04,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:04,283][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:04,284][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:05,018][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:05,314][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:05,964][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:06,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:06,939][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:07,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:07,588][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:08,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:08,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:09,214][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:09,540][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:09,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:10,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:10,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:11,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:11,814][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:12,789][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:13,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:13,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:14,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:14,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:14,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:15,079][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:15,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:16,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:16,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:16,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:16,904][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:17,848][__main__][INFO] - Iteration 462 took 24s (42.89% Gen, 53.29% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 43m 3s. Estimated total time: 20h 38m 32s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 17s, 500 more iterations: 3h 26m 25s. [2025-11-13 10:59:17,851][__main__][INFO] - Starting iteration 462. [2025-11-13 10:59:17,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:17,854][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:27,894][__main__][INFO] - Number of regex retries in iteration 462: 0 [2025-11-13 10:59:27,895][__main__][INFO] - agents played in iteration 462 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:59:28,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:28,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:28,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:28,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:28,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:28,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:30,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:30,510][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:31,813][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:32,137][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:33,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:33,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:33,762][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:34,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:34,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:35,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:35,709][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:36,034][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:36,692][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:37,017][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:37,341][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:37,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:37,996][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:38,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:38,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:39,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:39,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:40,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:41,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:41,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:41,161][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:42,184][__main__][INFO] - Iteration 463 took 24s (41.27% Gen, 54.52% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 20m 40s. Estimated total time: 20h 16m 33s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 33s, 500 more iterations: 3h 22m 45s. [2025-11-13 10:59:42,186][__main__][INFO] - Starting iteration 463. [2025-11-13 10:59:42,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:42,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:52,278][__main__][INFO] - Number of regex retries in iteration 463: 0 [2025-11-13 10:59:52,278][__main__][INFO] - agents played in iteration 463 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 10:59:52,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:52,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:53,624][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:53,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:54,246][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:54,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:54,893][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:55,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:55,869][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:56,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:57,817][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:58,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:58,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:00,749][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:01,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:01,401][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:01,726][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:02,058][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:02,706][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:03,681][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:04,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:04,766][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:05,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:05,504][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:05,506][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:06,556][__main__][INFO] - Iteration 464 took 24s (41.40% Gen, 54.28% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 22m 6s. Estimated total time: 20h 18m 23s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 3s. [2025-11-13 11:00:06,559][__main__][INFO] - Starting iteration 464. [2025-11-13 11:00:06,562][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:06,562][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:17,059][__main__][INFO] - Number of regex retries in iteration 464: 0 [2025-11-13 11:00:17,059][__main__][INFO] - agents played in iteration 464 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:00:17,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:17,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:17,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:17,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:17,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:17,664][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:18,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:19,079][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:19,404][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:19,730][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:20,057][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:20,714][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:21,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:21,694][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:22,017][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:22,344][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:22,999][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:23,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:23,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:24,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:24,954][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:25,279][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:25,608][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:25,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:26,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:26,904][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:27,553][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:28,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:28,530][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:28,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:29,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:30,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:30,369][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:30,371][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:31,357][__main__][INFO] - Iteration 465 took 24s (42.33% Gen, 53.69% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 43m 6s. Estimated total time: 20h 39m 48s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 19s, 500 more iterations: 3h 26m 38s. [2025-11-13 11:00:31,359][__main__][INFO] - Starting iteration 465. [2025-11-13 11:00:31,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:31,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:41,159][__main__][INFO] - Number of regex retries in iteration 465: 0 [2025-11-13 11:00:41,160][__main__][INFO] - agents played in iteration 465 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:00:41,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:41,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:41,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:41,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:41,729][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:41,729][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:42,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:42,814][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:43,144][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:43,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:44,122][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:45,749][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:46,725][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:47,050][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:47,375][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:47,707][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:48,025][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:48,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:48,675][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:49,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:49,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:49,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:49,974][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:50,947][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:51,271][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:51,598][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:52,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:53,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:54,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:54,390][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:54,392][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:55,446][__main__][INFO] - Iteration 466 took 24s (40.68% Gen, 54.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 7m 6s. Estimated total time: 20h 4m 13s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 8s, 500 more iterations: 3h 20m 42s. [2025-11-13 11:00:55,448][__main__][INFO] - Starting iteration 466. [2025-11-13 11:00:55,452][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:55,452][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:05,297][__main__][INFO] - Number of regex retries in iteration 466: 0 [2025-11-13 11:01:05,298][__main__][INFO] - agents played in iteration 466 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:01:05,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:05,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:05,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:05,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:05,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:05,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:06,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:06,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:07,937][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:08,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:08,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:09,564][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:09,890][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:10,215][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:10,540][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:10,865][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:11,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:12,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:12,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:13,466][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:13,790][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:14,115][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:14,439][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:14,765][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:15,413][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:16,711][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:17,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:17,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:18,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:18,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:18,571][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:19,598][__main__][INFO] - Iteration 467 took 24s (40.77% Gen, 54.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 49s. Estimated total time: 20h 7m 20s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 14s, 500 more iterations: 3h 21m 13s. [2025-11-13 11:01:19,600][__main__][INFO] - Starting iteration 467. [2025-11-13 11:01:19,603][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:01:19,603][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:29,153][__main__][INFO] - Number of regex retries in iteration 467: 0 [2025-11-13 11:01:29,153][__main__][INFO] - agents played in iteration 467 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:01:29,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:29,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:29,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:29,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:29,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:29,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:30,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:31,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:31,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:31,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:32,514][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:33,813][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:34,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:35,115][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:36,090][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:36,416][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:36,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:37,071][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:37,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:38,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:38,373][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:39,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:39,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:39,671][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:39,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:40,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:40,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:41,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:42,486][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:42,487][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:42,489][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:43,485][__main__][INFO] - Iteration 468 took 23s (39.99% Gen, 55.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 14s. Estimated total time: 19h 54m 9s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 1s. [2025-11-13 11:01:43,487][__main__][INFO] - Starting iteration 468. [2025-11-13 11:01:43,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:01:43,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:53,720][__main__][INFO] - Number of regex retries in iteration 468: 0 [2025-11-13 11:01:53,721][__main__][INFO] - agents played in iteration 468 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:01:54,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:54,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:54,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:54,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:54,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:54,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:55,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:55,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:56,034][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:56,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:56,685][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:57,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:57,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:57,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:57,985][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:58,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:59,936][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:00,263][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:00,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:00,917][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:01,246][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:01,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:02,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:02,875][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:03,199][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:03,524][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:04,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:04,501][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:04,825][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:05,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:06,241][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:06,997][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:06,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:07,000][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:07,987][__main__][INFO] - Iteration 469 took 24s (41.58% Gen, 54.21% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 26m 34s. Estimated total time: 20h 24m 53s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 49s, 500 more iterations: 3h 24m 8s. [2025-11-13 11:02:07,990][__main__][INFO] - Starting iteration 469. [2025-11-13 11:02:07,993][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:02:07,993][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:17,450][__main__][INFO] - Number of regex retries in iteration 469: 0 [2025-11-13 11:02:17,451][__main__][INFO] - agents played in iteration 469 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:02:17,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:17,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:18,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:18,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:18,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:18,039][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:18,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:19,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:19,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:19,754][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:20,082][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:20,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:20,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:21,397][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:22,050][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:22,376][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:22,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:23,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:23,676][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:24,329][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:25,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:26,284][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:26,939][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:27,265][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:27,593][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:27,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:28,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:28,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:29,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:29,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:30,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:30,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:30,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:31,746][__main__][INFO] - Iteration 470 took 23s (39.82% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 59s. Estimated total time: 19h 47m 42s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 57s. [2025-11-13 11:02:31,749][__main__][INFO] - Starting iteration 470. [2025-11-13 11:02:31,753][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:02:31,753][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:42,103][__main__][INFO] - Number of regex retries in iteration 470: 0 [2025-11-13 11:02:42,104][__main__][INFO] - agents played in iteration 470 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:02:42,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:42,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:42,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:42,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:42,684][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:42,684][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:43,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:44,101][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:45,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:45,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:45,730][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:46,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:46,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:47,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:47,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:47,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:48,344][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:48,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:48,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:49,645][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:49,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:50,621][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:50,948][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:51,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:52,246][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:52,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:53,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:53,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:54,574][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:55,314][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:55,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:55,317][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:57,270][__main__][INFO] - Iteration 471 took 25s (40.56% Gen, 51.78% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 16m 47s. Estimated total time: 21h 15m 56s. Time estimates for 10 more iterations: 4m 15s, 100 more iterations: 42m 31s, 500 more iterations: 3h 32m 39s. [2025-11-13 11:02:57,273][__main__][INFO] - Starting iteration 471. [2025-11-13 11:02:57,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:57,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:07,169][__main__][INFO] - Number of regex retries in iteration 471: 0 [2025-11-13 11:03:07,170][__main__][INFO] - agents played in iteration 471 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:03:07,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:07,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:07,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:07,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:07,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:07,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:08,561][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:08,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:09,184][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:09,518][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:10,167][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:10,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:11,152][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:11,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:12,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:12,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:13,432][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:13,756][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:14,081][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:14,730][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:15,381][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:15,708][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:16,687][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:17,012][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:17,339][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:17,665][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:17,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:18,638][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:18,965][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:19,746][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:20,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:20,524][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:20,526][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:21,536][__main__][INFO] - Iteration 472 took 24s (40.78% Gen, 55.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 28s. Estimated total time: 20h 13m 0s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 26s, 500 more iterations: 3h 22m 10s. [2025-11-13 11:03:21,538][__main__][INFO] - Starting iteration 472. [2025-11-13 11:03:21,541][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:21,541][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:31,208][__main__][INFO] - Number of regex retries in iteration 472: 0 [2025-11-13 11:03:31,209][__main__][INFO] - agents played in iteration 472 are Alice, Bob, Bob_buffer, Alice_buffer [2025-11-13 11:03:31,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:31,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:31,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:31,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:31,795][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:31,796][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:32,573][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:32,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:33,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:33,527][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:33,852][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:34,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:34,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:35,151][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:35,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:35,804][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:36,130][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:36,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:37,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:37,438][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:37,762][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:38,750][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:39,077][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:39,405][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:39,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:40,062][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:40,393][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:41,043][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:41,369][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:41,699][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:42,032][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:42,678][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:43,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:43,736][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:44,469][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:44,471][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:44,473][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:45,456][__main__][INFO] - Iteration 473 took 23s (40.42% Gen, 55.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 55m 51s. Estimated total time: 19h 55m 47s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 17s. [2025-11-13 11:03:45,458][__main__][INFO] - Starting iteration 473. [2025-11-13 11:03:45,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:45,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:06:50,937][mllm.models.large_language_model_local][INFO] - Loaded 47 past agent adapters from checkpoints directory. [2025-11-13 11:07:09,910][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,106][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,114][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:07:12,614][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:09:20,826][mllm.training.trainer_common][INFO] - Loading trainer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:09:20,829][mllm.training.trainer_common][INFO] - Loading policy optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:09:21,671][mllm.training.trainer_common][INFO] - Loading critic optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:09:21,674][__main__][INFO] - Starting iteration 473. [2025-11-13 11:09:21,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:09:21,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:09:48,544][__main__][INFO] - Number of regex retries in iteration 473: 0 [2025-11-13 11:09:48,545][__main__][INFO] - agents played in iteration 473 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:09:48,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:49,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:49,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:49,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:49,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:09:49,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:09:49,722][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:09:50,350][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:09:50,676][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:09:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:09:51,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:09:51,655][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:09:51,980][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:09:52,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:09:52,633][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:09:52,960][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:09:53,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:09:53,614][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:09:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:09:54,268][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:09:54,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:09:54,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:09:55,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:09:55,572][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:09:55,904][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:09:56,228][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:09:56,555][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:09:56,879][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:09:57,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:09:57,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:09:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:09:58,178][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:09:58,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:09:58,826][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:09:59,150][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:09:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:09:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:00,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:00,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:01,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.78%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:10:02,110][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:02,112][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:02,114][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:03,148][__main__][INFO] - Iteration 474 took 41s (64.78% Gen, 32.72% Train). Generation: 26s, Training: 13s. Estimated remaining time: 34h 30m 18s. Estimated total time: 34h 33m 32s. Time estimates for 10 more iterations: 6m 54s, 100 more iterations: 1h 9m 7s, 500 more iterations: 5h 45m 35s. [2025-11-13 11:10:03,151][__main__][INFO] - Starting iteration 474. [2025-11-13 11:10:03,155][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:03,155][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:20,664][__main__][INFO] - Number of regex retries in iteration 474: 0 [2025-11-13 11:10:20,665][__main__][INFO] - agents played in iteration 474 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:10:21,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:21,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:21,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:21,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:21,225][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:21,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:21,905][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:22,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:22,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:22,852][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:23,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:24,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:24,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:25,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:26,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:26,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:26,752][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:27,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:28,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:29,346][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:29,671][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:30,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:31,291][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:31,616][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:31,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:32,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:32,948][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:10:33,673][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:33,675][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:33,677][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:34,534][__main__][INFO] - Iteration 475 took 31s (55.79% Gen, 41.47% Train). Generation: 17s, Training: 13s. Estimated remaining time: 26h 5m 15s. Estimated total time: 26h 9m 0s. Time estimates for 10 more iterations: 5m 13s, 100 more iterations: 52m 18s, 500 more iterations: 4h 21m 30s. [2025-11-13 11:10:34,536][__main__][INFO] - Starting iteration 475. [2025-11-13 11:10:34,540][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:34,540][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:47,072][__main__][INFO] - Number of regex retries in iteration 475: 0 [2025-11-13 11:10:47,073][__main__][INFO] - agents played in iteration 475 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:10:47,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:47,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:47,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:47,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:47,621][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:47,621][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:48,941][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:49,266][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:49,921][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:50,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:50,898][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:51,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:51,878][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:52,206][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:52,531][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:53,838][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:54,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:54,490][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:55,462][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:55,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:56,117][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:56,441][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:57,099][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:57,423][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:58,079][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:58,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:58,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:59,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:00,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:00,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:00,151][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:00,973][__main__][INFO] - Iteration 476 took 26s (47.41% Gen, 49.47% Train). Generation: 12s, Training: 13s. Estimated remaining time: 21h 57m 31s. Estimated total time: 22h 1m 43s. Time estimates for 10 more iterations: 4m 24s, 100 more iterations: 44m 3s, 500 more iterations: 3h 40m 17s. [2025-11-13 11:11:00,975][__main__][INFO] - Starting iteration 476. [2025-11-13 11:11:00,977][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:11:00,978][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:11,974][__main__][INFO] - Number of regex retries in iteration 476: 0 [2025-11-13 11:11:11,975][__main__][INFO] - agents played in iteration 476 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:11:12,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:12,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:12,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:12,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:12,526][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:12,526][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:13,846][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:14,174][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:14,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:15,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:15,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:15,806][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:16,133][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:16,465][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:16,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:17,442][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:18,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:18,418][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:18,743][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:19,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:19,397][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:20,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:20,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:21,023][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:21,349][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:22,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:22,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:22,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:23,302][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:23,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:24,304][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:25,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:25,040][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:25,042][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:25,968][__main__][INFO] - Iteration 477 took 24s (44.00% Gen, 52.29% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 44m 58s. Estimated total time: 20h 49m 35s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 39s, 500 more iterations: 3h 28m 15s. [2025-11-13 11:11:25,971][__main__][INFO] - Starting iteration 477. [2025-11-13 11:11:25,974][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:11:25,974][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:36,944][__main__][INFO] - Number of regex retries in iteration 477: 0 [2025-11-13 11:11:36,944][__main__][INFO] - agents played in iteration 477 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:11:37,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:37,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:37,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:37,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:37,495][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:37,495][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:38,481][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:39,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:39,460][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:39,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:40,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:40,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:41,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:41,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:42,059][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:42,711][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:43,035][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:43,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:44,011][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:44,337][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:44,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:44,991][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:45,315][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:45,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:45,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:46,943][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:47,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:48,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:49,247][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:49,988][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:49,990][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:49,992][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:50,863][__main__][INFO] - Iteration 478 took 24s (44.07% Gen, 52.42% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 39m 28s. Estimated total time: 20h 44m 30s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 29s, 500 more iterations: 3h 27m 25s. [2025-11-13 11:11:50,865][__main__][INFO] - Starting iteration 478. [2025-11-13 11:11:50,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:11:50,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:01,141][__main__][INFO] - Number of regex retries in iteration 478: 0 [2025-11-13 11:12:01,142][__main__][INFO] - agents played in iteration 478 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:12:01,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:01,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:01,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:01,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:01,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:01,699][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:02,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:02,697][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:03,351][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:03,678][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:04,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:04,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:04,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:05,309][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:05,640][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:05,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:06,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:06,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:07,265][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:07,590][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:07,915][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:08,565][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:08,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:09,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:09,870][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:10,197][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:10,521][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:10,848][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:11,172][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:11,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:11,821][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:12,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:12,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:12,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:13,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:14,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:14,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:14,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:15,206][__main__][INFO] - Iteration 479 took 24s (42.20% Gen, 53.69% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 11m 31s. Estimated total time: 20h 16m 57s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 33s, 500 more iterations: 3h 22m 49s. [2025-11-13 11:12:15,208][__main__][INFO] - Starting iteration 479. [2025-11-13 11:12:15,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:12:15,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:25,356][__main__][INFO] - Number of regex retries in iteration 479: 0 [2025-11-13 11:12:25,357][__main__][INFO] - agents played in iteration 479 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:12:25,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:25,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:25,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:25,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:25,910][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:25,910][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:26,911][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:27,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:27,892][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:28,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:28,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:28,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:29,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:29,532][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:29,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:30,507][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:31,166][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:31,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:31,816][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:32,144][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:32,797][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:33,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:33,457][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:34,118][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:34,770][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:35,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:35,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:35,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:36,072][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:36,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:36,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:37,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:37,721][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:38,479][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:38,481][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:38,482][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:39,373][__main__][INFO] - Iteration 480 took 24s (41.98% Gen, 54.32% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 2m 14s. Estimated total time: 20h 8m 4s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 16s, 500 more iterations: 3h 21m 20s. [2025-11-13 11:12:39,375][__main__][INFO] - Starting iteration 480. [2025-11-13 11:12:39,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:12:39,379][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:49,091][__main__][INFO] - Number of regex retries in iteration 480: 0 [2025-11-13 11:12:49,092][__main__][INFO] - agents played in iteration 480 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:12:49,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:49,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:49,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:49,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:49,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:49,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:50,346][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:50,644][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:50,972][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:51,299][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:51,627][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:51,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:52,280][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:52,608][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:52,937][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:53,593][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:54,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:54,573][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:55,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:55,552][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:56,855][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:57,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:57,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:57,830][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:58,157][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:58,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:59,136][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:59,461][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:00,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:01,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:02,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:02,206][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:02,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:03,950][__main__][INFO] - Iteration 481 took 24s (39.53% Gen, 53.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 22m 24s. Estimated total time: 20h 28m 38s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 57s, 500 more iterations: 3h 24m 46s. [2025-11-13 11:13:03,953][__main__][INFO] - Starting iteration 481. [2025-11-13 11:13:03,955][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:03,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:14,598][__main__][INFO] - Number of regex retries in iteration 481: 0 [2025-11-13 11:13:14,599][__main__][INFO] - agents played in iteration 481 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:13:15,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:15,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:15,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:15,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:15,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:15,140][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:16,108][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:17,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:17,745][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:18,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:18,726][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:19,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:19,376][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:20,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:20,354][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:20,682][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:21,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:21,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:22,635][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:23,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:24,269][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:24,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:25,248][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:25,579][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:26,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:26,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:27,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:27,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:27,607][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:28,438][__main__][INFO] - Iteration 482 took 24s (43.47% Gen, 53.13% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 17m 33s. Estimated total time: 20h 24m 13s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 48s, 500 more iterations: 3h 24m 2s. [2025-11-13 11:13:28,441][__main__][INFO] - Starting iteration 482. [2025-11-13 11:13:28,444][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:28,445][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:38,025][__main__][INFO] - Number of regex retries in iteration 482: 0 [2025-11-13 11:13:38,025][__main__][INFO] - agents played in iteration 482 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:13:38,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:38,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:38,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:38,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:38,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:38,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:39,258][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:39,883][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:40,540][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:40,870][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:41,197][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:42,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:42,519][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:42,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:43,172][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:43,501][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:44,806][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:45,133][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:45,457][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:45,782][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:46,107][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:46,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:48,068][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:48,394][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:48,721][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:49,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:49,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:50,391][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:51,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:51,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:51,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:52,192][__main__][INFO] - Iteration 483 took 23s (40.34% Gen, 55.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 40m 25s. Estimated total time: 19h 47m 29s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 54s. [2025-11-13 11:13:52,195][__main__][INFO] - Starting iteration 483. [2025-11-13 11:13:52,198][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:52,198][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:02,313][__main__][INFO] - Number of regex retries in iteration 483: 0 [2025-11-13 11:14:02,313][__main__][INFO] - agents played in iteration 483 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:14:02,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:02,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:02,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:02,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:02,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:02,869][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:03,859][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:04,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:04,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:05,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:05,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:06,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:06,812][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:07,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:07,464][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:07,791][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:08,116][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:09,422][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:10,079][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:10,405][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:10,731][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:11,712][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:12,037][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:12,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:12,687][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:13,340][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:13,994][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:14,668][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:15,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:15,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:15,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:16,306][__main__][INFO] - Iteration 484 took 24s (41.95% Gen, 54.32% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 58m 2s. Estimated total time: 20h 5m 29s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 10s, 500 more iterations: 3h 20m 54s. [2025-11-13 11:14:16,309][__main__][INFO] - Starting iteration 484. [2025-11-13 11:14:16,313][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:16,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:25,492][__main__][INFO] - Number of regex retries in iteration 484: 0 [2025-11-13 11:14:25,493][__main__][INFO] - agents played in iteration 484 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:14:25,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:25,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:26,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:26,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:26,046][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:26,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:26,729][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:27,027][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:27,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:28,335][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:28,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:28,990][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:29,317][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:29,646][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:31,608][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:31,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:32,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:32,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:32,915][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:33,242][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:33,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:33,891][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:34,544][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:34,869][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:35,847][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:36,830][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:37,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:37,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:38,564][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:38,565][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:38,567][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:39,496][__main__][INFO] - Iteration 485 took 23s (39.59% Gen, 56.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 11m 25s. Estimated total time: 19h 19m 16s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 12s. [2025-11-13 11:14:39,498][__main__][INFO] - Starting iteration 485. [2025-11-13 11:14:39,501][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:39,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:49,375][__main__][INFO] - Number of regex retries in iteration 485: 0 [2025-11-13 11:14:49,376][__main__][INFO] - agents played in iteration 485 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:14:49,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:49,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:49,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:49,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:49,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:49,933][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:50,924][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:51,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:51,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:52,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:52,885][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:53,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:53,539][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:53,865][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:54,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:54,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:54,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:55,169][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:55,499][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:55,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:56,150][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:56,479][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:56,812][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:57,140][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:57,468][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:58,120][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:58,445][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:58,772][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:59,099][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:59,754][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:00,080][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:00,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:01,061][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:01,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:02,459][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:02,460][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:02,462][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:03,334][__main__][INFO] - Iteration 486 took 23s (41.43% Gen, 54.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 43m 25s. Estimated total time: 19h 51m 39s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 36s. [2025-11-13 11:15:03,336][__main__][INFO] - Starting iteration 486. [2025-11-13 11:15:03,339][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:15:03,339][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:13,238][__main__][INFO] - Number of regex retries in iteration 486: 0 [2025-11-13 11:15:13,238][__main__][INFO] - agents played in iteration 486 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:15:13,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:13,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:13,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:13,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:13,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:13,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:15,123][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:15,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:16,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:16,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:17,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:18,403][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:19,060][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:19,712][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:20,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:20,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:20,695][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:21,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:22,005][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:22,333][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:22,661][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:22,989][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:23,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:23,650][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:24,633][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:24,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:25,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:26,367][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:26,368][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:26,370][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:27,238][__main__][INFO] - Iteration 487 took 23s (41.42% Gen, 54.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 46m 22s. Estimated total time: 19h 55m 0s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 10s. [2025-11-13 11:15:27,240][__main__][INFO] - Starting iteration 487. [2025-11-13 11:15:27,245][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:15:27,245][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:36,705][__main__][INFO] - Number of regex retries in iteration 487: 0 [2025-11-13 11:15:36,705][__main__][INFO] - agents played in iteration 487 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:15:37,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:37,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:37,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:38,264][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:38,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:38,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:39,248][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:39,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:39,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:40,228][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:40,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:40,881][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:41,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:41,874][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:42,203][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:42,854][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:43,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:43,505][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:43,834][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:44,160][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:44,488][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:44,817][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:45,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:45,794][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:46,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:46,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:46,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:47,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:47,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:47,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:48,073][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:48,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:49,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:49,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:49,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:49,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:50,780][__main__][INFO] - Iteration 488 took 23s (40.19% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 27m 48s. Estimated total time: 19h 36m 50s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 13s, 500 more iterations: 3h 16m 8s. [2025-11-13 11:15:50,782][__main__][INFO] - Starting iteration 488. [2025-11-13 11:15:50,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:15:50,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:59,413][__main__][INFO] - Number of regex retries in iteration 488: 0 [2025-11-13 11:15:59,413][__main__][INFO] - agents played in iteration 488 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:15:59,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,966][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:59,967][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:00,685][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:00,986][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:01,646][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:01,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:02,299][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:02,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:02,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:03,296][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:03,954][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:04,614][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:04,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:05,276][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:05,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:05,931][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:06,262][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:06,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:06,916][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:07,243][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:07,572][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:07,897][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:08,229][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:08,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:08,880][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:09,206][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:09,531][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:10,181][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:10,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:11,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:11,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:12,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:12,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:12,538][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:13,482][__main__][INFO] - Iteration 489 took 22s (38.01% Gen, 57.83% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 45m 26s. Estimated total time: 18h 54m 51s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 8s. [2025-11-13 11:16:13,484][__main__][INFO] - Starting iteration 489. [2025-11-13 11:16:13,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:16:13,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:22,835][__main__][INFO] - Number of regex retries in iteration 489: 0 [2025-11-13 11:16:22,836][__main__][INFO] - agents played in iteration 489 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:16:23,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:23,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:23,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:23,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:23,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:23,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:24,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:24,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:25,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:25,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:26,044][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:26,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:26,697][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:27,358][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:27,684][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:28,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:28,347][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:28,676][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:29,331][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:29,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:29,988][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:30,315][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:30,976][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:31,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:31,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:31,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:32,281][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:32,943][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:33,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:33,593][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:34,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:34,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:35,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:35,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:35,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:35,954][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:36,959][__main__][INFO] - Iteration 490 took 23s (39.83% Gen, 55.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 23m 52s. Estimated total time: 19h 33m 40s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 36s. [2025-11-13 11:16:36,961][__main__][INFO] - Starting iteration 490. [2025-11-13 11:16:36,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:16:36,965][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:46,523][__main__][INFO] - Number of regex retries in iteration 490: 0 [2025-11-13 11:16:46,523][__main__][INFO] - agents played in iteration 490 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:16:46,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:47,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:47,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:47,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:47,083][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:47,084][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:47,814][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:48,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:48,439][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:49,093][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:49,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:49,745][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:51,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:51,378][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:51,707][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:52,360][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:52,685][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:53,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:53,339][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:53,992][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:54,319][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:54,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:55,295][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:55,623][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:55,948][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:56,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:56,597][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:56,922][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:57,253][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:57,578][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:58,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:58,881][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:59,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:59,604][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:59,606][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:01,457][__main__][INFO] - Iteration 491 took 24s (39.02% Gen, 53.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 14m 25s. Estimated total time: 20h 24m 38s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 49s, 500 more iterations: 3h 24m 6s. [2025-11-13 11:17:01,459][__main__][INFO] - Starting iteration 491. [2025-11-13 11:17:01,462][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:01,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:11,205][__main__][INFO] - Number of regex retries in iteration 491: 0 [2025-11-13 11:17:11,206][__main__][INFO] - agents played in iteration 491 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:17:11,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:11,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:11,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:11,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:11,770][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:11,770][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:12,796][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:13,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:13,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:13,776][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:14,102][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:14,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:14,755][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:15,082][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:15,409][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:15,736][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:16,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:17,368][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:18,021][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:18,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:18,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:19,321][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:19,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:19,979][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:20,304][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:20,629][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:20,954][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:21,930][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:22,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:22,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:23,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:24,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:24,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:24,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:25,232][__main__][INFO] - Iteration 492 took 23s (40.99% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 37m 58s. Estimated total time: 19h 48m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 5s. [2025-11-13 11:17:25,234][__main__][INFO] - Starting iteration 492. [2025-11-13 11:17:25,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:25,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:34,335][__main__][INFO] - Number of regex retries in iteration 492: 0 [2025-11-13 11:17:34,336][__main__][INFO] - agents played in iteration 492 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:17:34,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:34,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:34,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:34,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:34,898][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:34,899][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:35,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:35,921][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:36,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:36,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:36,908][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:37,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:37,885][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:38,543][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:38,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:39,198][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:39,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:39,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:40,175][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:40,506][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:41,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:41,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:42,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:42,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:43,120][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:43,444][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:43,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:44,093][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:44,421][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:44,747][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:45,075][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:45,400][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:45,726][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:46,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:46,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:47,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:47,483][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:47,484][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:48,401][__main__][INFO] - Iteration 493 took 23s (39.27% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 12s. Estimated total time: 19h 18m 11s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 1s. [2025-11-13 11:17:48,403][__main__][INFO] - Starting iteration 493. [2025-11-13 11:17:48,407][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:48,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:57,500][__main__][INFO] - Number of regex retries in iteration 493: 0 [2025-11-13 11:17:57,500][__main__][INFO] - agents played in iteration 493 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:17:57,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:57,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:58,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:58,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:58,056][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:58,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:58,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:59,732][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:00,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:00,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:00,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:01,046][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:01,374][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:01,700][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:02,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:03,019][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:03,349][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:03,678][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:04,010][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:04,669][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:04,997][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:05,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:05,987][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:06,315][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:06,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:06,976][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:07,634][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:07,966][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:08,294][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:08,622][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:09,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:09,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:10,656][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:10,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:10,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:11,547][__main__][INFO] - Iteration 494 took 23s (39.29% Gen, 56.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 41s. Estimated total time: 19h 17m 4s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 50s. [2025-11-13 11:18:11,549][__main__][INFO] - Starting iteration 494. [2025-11-13 11:18:11,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:11,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:20,979][__main__][INFO] - Number of regex retries in iteration 494: 0 [2025-11-13 11:18:20,980][__main__][INFO] - agents played in iteration 494 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:18:21,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:21,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:21,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:21,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:21,540][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:21,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:22,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:22,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:22,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:23,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:23,556][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:23,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:24,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:24,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:24,873][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:25,201][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:25,531][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:25,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:26,194][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:26,524][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:26,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:27,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:27,523][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:27,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:28,187][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:28,518][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:28,851][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:29,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:30,169][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:30,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:31,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:31,473][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:32,131][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:32,458][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:32,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:33,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:34,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:34,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:34,179][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:35,107][__main__][INFO] - Iteration 495 took 23s (39.90% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 25m 59s. Estimated total time: 19h 37m 45s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 17s. [2025-11-13 11:18:35,109][__main__][INFO] - Starting iteration 495. [2025-11-13 11:18:35,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:35,113][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:44,552][__main__][INFO] - Number of regex retries in iteration 495: 0 [2025-11-13 11:18:44,552][__main__][INFO] - agents played in iteration 495 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:18:44,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:45,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:45,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:45,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:45,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:45,115][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:45,831][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:46,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:46,458][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:47,117][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:47,449][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:48,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:48,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:48,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:49,415][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:49,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:50,074][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:50,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:50,730][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:51,057][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:51,383][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:51,711][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:52,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:52,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:52,690][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:53,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:53,678][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:54,003][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:54,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:54,994][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:55,319][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:55,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:55,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:56,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:57,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:57,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:57,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:57,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:58,652][__main__][INFO] - Iteration 496 took 23s (40.10% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 24m 52s. Estimated total time: 19h 37m 2s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 10s. [2025-11-13 11:18:58,654][__main__][INFO] - Starting iteration 496. [2025-11-13 11:18:58,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:58,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:07,531][__main__][INFO] - Number of regex retries in iteration 496: 0 [2025-11-13 11:19:07,531][__main__][INFO] - agents played in iteration 496 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:19:07,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:08,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:08,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:08,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:08,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:08,094][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:08,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:09,120][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:09,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:09,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:10,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:10,431][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:10,765][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:11,094][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:11,423][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:11,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:12,082][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:12,416][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:12,742][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:13,397][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:13,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:14,385][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:14,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:15,035][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:15,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:15,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:16,011][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:16,342][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:16,675][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:17,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:17,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:17,652][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:17,977][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:18,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:18,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:18,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:19,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:19,989][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:20,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:20,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:20,734][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:21,717][__main__][INFO] - Iteration 497 took 23s (38.47% Gen, 57.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 0m 28s. Estimated total time: 19h 13m 0s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 10s. [2025-11-13 11:19:21,719][__main__][INFO] - Starting iteration 497. [2025-11-13 11:19:21,723][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:19:21,723][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:31,109][__main__][INFO] - Number of regex retries in iteration 497: 0 [2025-11-13 11:19:31,110][__main__][INFO] - agents played in iteration 497 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:19:31,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:31,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:31,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:31,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:31,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:31,668][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:32,392][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:32,690][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:33,018][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:33,674][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:34,007][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:34,334][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:34,990][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:35,319][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:35,645][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:36,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:37,944][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:38,930][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:39,256][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:39,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:40,232][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:40,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:40,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:41,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:41,875][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:42,200][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:42,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:43,547][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:44,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:44,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:44,283][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:45,155][__main__][INFO] - Iteration 498 took 23s (40.06% Gen, 56.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 45s. Estimated total time: 19h 31m 41s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 16s. [2025-11-13 11:19:45,158][__main__][INFO] - Starting iteration 498. [2025-11-13 11:19:45,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:19:45,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:54,715][__main__][INFO] - Number of regex retries in iteration 498: 0 [2025-11-13 11:19:54,715][__main__][INFO] - agents played in iteration 498 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:19:55,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:55,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:55,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:55,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:55,277][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:55,277][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:56,003][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:56,301][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:56,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:56,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:57,279][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:57,606][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:57,932][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:58,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:58,584][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:58,916][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:59,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:59,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:59,894][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:00,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:00,875][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:01,201][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:01,530][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:02,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:02,505][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:03,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:03,479][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:03,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:04,131][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:04,456][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:04,781][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:05,764][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:06,420][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:07,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:07,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:07,872][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:07,873][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:08,778][__main__][INFO] - Iteration 499 took 23s (40.45% Gen, 55.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 27m 35s. Estimated total time: 19h 40m 54s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 11:20:08,780][__main__][INFO] - Starting iteration 499. [2025-11-13 11:20:08,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:20:08,784][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:18,036][__main__][INFO] - Number of regex retries in iteration 499: 0 [2025-11-13 11:20:18,037][__main__][INFO] - agents played in iteration 499 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:20:18,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:18,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:18,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:18,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:18,597][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:18,597][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:19,327][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:19,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:19,953][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:20,609][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:21,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:21,919][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:22,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:22,576][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:22,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:23,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:23,887][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:25,206][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:25,534][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:26,185][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:26,510][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:26,836][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:27,164][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:27,817][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:28,142][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:28,469][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:29,119][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:29,445][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:29,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:30,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:31,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:31,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:31,202][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:32,099][__main__][INFO] - Iteration 500 took 23s (39.68% Gen, 56.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 12m 6s. Estimated total time: 19h 25m 49s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 18s. [2025-11-13 11:20:32,102][__main__][INFO] - Starting iteration 500. [2025-11-13 11:20:32,105][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:20:32,105][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:40,752][__main__][INFO] - Number of regex retries in iteration 500: 0 [2025-11-13 11:20:40,753][__main__][INFO] - agents played in iteration 500 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:20:41,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:41,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:41,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:41,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:41,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:41,312][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:42,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:42,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:43,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:43,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:43,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:44,304][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:46,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:46,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:47,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:47,578][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:47,904][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:48,557][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:48,882][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:49,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:49,534][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:50,183][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:50,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:50,834][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:51,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:51,817][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:52,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:52,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:53,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:53,859][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:53,861][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:53,862][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:55,636][__main__][INFO] - Iteration 501 took 23s (36.74% Gen, 55.71% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 22m 31s. Estimated total time: 19h 36m 37s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 13s, 500 more iterations: 3h 16m 6s. [2025-11-13 11:20:55,640][__main__][INFO] - Starting iteration 501. [2025-11-13 11:20:55,643][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:55,644][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:04,413][__main__][INFO] - Number of regex retries in iteration 501: 0 [2025-11-13 11:21:04,414][__main__][INFO] - agents played in iteration 501 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:21:04,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:04,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:04,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:04,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:04,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:04,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:05,705][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:06,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:06,984][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:07,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:07,634][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:07,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:08,285][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:08,610][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:10,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:10,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:10,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:11,232][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:11,559][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:11,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:12,210][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:12,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:13,519][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:13,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:14,496][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:15,151][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:15,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:15,801][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:16,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:16,824][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:17,570][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:17,572][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:17,573][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:18,548][__main__][INFO] - Iteration 502 took 22s (38.28% Gen, 57.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 50m 48s. Estimated total time: 19h 5m 17s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 52s. [2025-11-13 11:21:18,553][__main__][INFO] - Starting iteration 502. [2025-11-13 11:21:18,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:18,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:27,550][__main__][INFO] - Number of regex retries in iteration 502: 0 [2025-11-13 11:21:27,550][__main__][INFO] - agents played in iteration 502 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:21:27,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:28,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:28,843][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:29,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:29,800][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:30,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:30,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:30,779][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:31,107][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:31,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:31,761][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:32,086][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:32,412][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:32,741][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:33,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:33,394][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:33,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:34,047][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:34,702][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:35,689][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:36,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:36,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:37,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:37,976][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:38,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:38,627][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:39,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:39,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:40,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:40,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:40,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:41,596][__main__][INFO] - Iteration 503 took 23s (39.03% Gen, 57.03% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 57m 7s. Estimated total time: 19h 11m 59s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 59s. [2025-11-13 11:21:41,598][__main__][INFO] - Starting iteration 503. [2025-11-13 11:21:41,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:41,602][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:50,665][__main__][INFO] - Number of regex retries in iteration 503: 0 [2025-11-13 11:21:50,666][__main__][INFO] - agents played in iteration 503 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:21:51,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:51,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:51,939][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:52,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:52,895][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:53,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:53,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:53,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:54,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:54,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:55,187][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:55,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:56,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:56,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:56,822][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:57,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:57,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:57,802][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:58,468][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:58,796][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:59,124][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:59,453][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:59,783][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:00,116][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:00,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:00,773][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:01,101][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:01,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:02,085][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:02,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:03,082][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:03,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:03,835][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:03,837][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:04,849][__main__][INFO] - Iteration 504 took 23s (38.99% Gen, 56.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 9s. Estimated total time: 19h 22m 25s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 44s. [2025-11-13 11:22:04,851][__main__][INFO] - Starting iteration 504. [2025-11-13 11:22:04,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:04,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:13,785][__main__][INFO] - Number of regex retries in iteration 504: 0 [2025-11-13 11:22:13,786][__main__][INFO] - agents played in iteration 504 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:22:14,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:14,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:15,357][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:16,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:16,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:17,317][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:17,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:17,971][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:18,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:18,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:18,950][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:19,276][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:19,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:19,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:20,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:20,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:20,913][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:21,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:21,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:22,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:22,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:23,204][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:24,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:24,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:24,842][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:25,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:26,150][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:26,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:26,894][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:26,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:27,830][__main__][INFO] - Iteration 505 took 22s (38.87% Gen, 57.05% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 53m 12s. Estimated total time: 19h 8m 51s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 28s. [2025-11-13 11:22:27,833][__main__][INFO] - Starting iteration 505. [2025-11-13 11:22:27,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:27,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:36,222][__main__][INFO] - Number of regex retries in iteration 505: 0 [2025-11-13 11:22:36,223][__main__][INFO] - agents played in iteration 505 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:22:36,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:36,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:36,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:36,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:36,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:36,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:37,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:37,807][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:38,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:38,788][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:39,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:39,768][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:41,077][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:41,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:41,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:42,064][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:42,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:42,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:43,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:43,373][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:43,702][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:44,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:45,334][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:45,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:46,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:46,975][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:47,300][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:47,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:47,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:48,608][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:49,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:49,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:49,339][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:50,271][__main__][INFO] - Iteration 506 took 22s (37.38% Gen, 58.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 46s. Estimated total time: 18h 41m 48s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 23s, 500 more iterations: 3h 6m 58s. [2025-11-13 11:22:50,273][__main__][INFO] - Starting iteration 506. [2025-11-13 11:22:50,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:50,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:59,161][__main__][INFO] - Number of regex retries in iteration 506: 0 [2025-11-13 11:22:59,162][__main__][INFO] - agents played in iteration 506 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:22:59,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:59,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:59,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:59,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:59,721][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:59,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:00,739][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:01,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:02,048][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:02,373][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:02,700][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:03,027][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:03,681][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:04,336][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:04,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:04,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:05,322][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:05,651][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:06,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:06,641][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:06,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:08,289][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:08,622][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:09,277][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:09,935][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:10,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:10,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:11,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:12,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:12,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:12,304][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:13,258][__main__][INFO] - Iteration 507 took 22s (38.66% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 52m 44s. Estimated total time: 19h 9m 8s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 31s. [2025-11-13 11:23:13,260][__main__][INFO] - Starting iteration 507. [2025-11-13 11:23:13,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:23:13,264][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:21,834][__main__][INFO] - Number of regex retries in iteration 507: 0 [2025-11-13 11:23:21,835][__main__][INFO] - agents played in iteration 507 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:23:22,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:22,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:22,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:22,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:22,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:22,396][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:23,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:23,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:24,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:24,398][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:24,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:25,378][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:26,360][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:26,686][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:27,013][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:27,339][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:28,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:28,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:29,633][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:30,287][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:30,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:30,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:31,269][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:31,931][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:32,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:32,918][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:33,247][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:33,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:34,230][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:34,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:34,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:34,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:35,898][__main__][INFO] - Iteration 508 took 22s (37.86% Gen, 58.01% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 34m 58s. Estimated total time: 18h 51m 45s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 37s. [2025-11-13 11:23:35,900][__main__][INFO] - Starting iteration 508. [2025-11-13 11:23:35,904][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:23:35,904][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:44,948][__main__][INFO] - Number of regex retries in iteration 508: 0 [2025-11-13 11:23:44,948][__main__][INFO] - agents played in iteration 508 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:23:45,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:45,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:45,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:45,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:45,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:45,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:46,239][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:46,536][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:46,863][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:47,189][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:47,516][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:47,841][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:48,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:48,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:48,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:49,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:49,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:50,136][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:50,461][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:50,788][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:51,117][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:51,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:52,430][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:52,755][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:53,736][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:54,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:54,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:54,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:55,047][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:55,701][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:56,028][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:56,359][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:56,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:57,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:58,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:58,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:58,082][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:58,957][__main__][INFO] - Iteration 509 took 23s (39.23% Gen, 56.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 55m 33s. Estimated total time: 19h 12m 43s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 7s. [2025-11-13 11:23:58,960][__main__][INFO] - Starting iteration 509. [2025-11-13 11:23:58,963][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:23:58,964][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:07,513][__main__][INFO] - Number of regex retries in iteration 509: 0 [2025-11-13 11:24:07,514][__main__][INFO] - agents played in iteration 509 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:24:07,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:07,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:08,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:08,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:08,075][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:08,075][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:08,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:09,118][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:09,773][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:10,100][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:10,428][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:10,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:11,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:11,736][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:12,064][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:12,391][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:12,717][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:13,371][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:13,696][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:14,349][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:15,347][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:15,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:16,325][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:16,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:17,301][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:17,627][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:17,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:18,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:18,930][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:19,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:19,914][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:20,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:20,650][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:20,652][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:21,547][__main__][INFO] - Iteration 510 took 22s (37.86% Gen, 58.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 31m 42s. Estimated total time: 18h 49m 15s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 12s. [2025-11-13 11:24:21,549][__main__][INFO] - Starting iteration 510. [2025-11-13 11:24:21,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:24:21,554][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:30,397][__main__][INFO] - Number of regex retries in iteration 510: 0 [2025-11-13 11:24:30,397][__main__][INFO] - agents played in iteration 510 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:24:30,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:30,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:30,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:30,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:30,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:30,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:31,680][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:31,978][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:32,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:32,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:32,959][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:33,286][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:33,613][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:33,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:34,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:34,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:35,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:35,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:35,903][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:36,563][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:37,549][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:37,878][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:38,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:38,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:39,201][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:39,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:39,860][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:40,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:40,520][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:40,848][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:41,515][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:42,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:42,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:43,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:43,566][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:43,568][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:45,333][__main__][INFO] - Iteration 511 took 23s (37.19% Gen, 55.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 31m 7s. Estimated total time: 19h 49m 3s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 10s. [2025-11-13 11:24:45,335][__main__][INFO] - Starting iteration 511. [2025-11-13 11:24:45,338][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:45,339][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:54,829][__main__][INFO] - Number of regex retries in iteration 511: 0 [2025-11-13 11:24:54,830][__main__][INFO] - agents played in iteration 511 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:24:55,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:55,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:56,125][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:56,423][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:57,077][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:58,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:58,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:59,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:59,365][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:59,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:00,020][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:00,674][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:01,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:01,327][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:01,652][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:01,980][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:02,305][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:02,632][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:02,961][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:03,293][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:03,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:04,272][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:04,930][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:05,582][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:05,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:06,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:06,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:07,246][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:07,980][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:07,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:07,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:08,914][__main__][INFO] - Iteration 512 took 23s (40.25% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 20m 31s. Estimated total time: 19h 38m 51s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 28s. [2025-11-13 11:25:08,917][__main__][INFO] - Starting iteration 512. [2025-11-13 11:25:08,921][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:08,921][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:17,869][__main__][INFO] - Number of regex retries in iteration 512: 0 [2025-11-13 11:25:17,870][__main__][INFO] - agents played in iteration 512 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:25:18,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:18,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:18,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:18,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:18,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:18,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:19,168][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:19,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:20,122][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:20,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:21,105][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:21,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:21,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:22,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:22,739][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:23,393][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:23,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:24,045][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:24,371][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:25,354][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:25,682][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:26,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:26,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:26,987][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:27,313][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:27,641][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:27,967][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:28,619][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:29,270][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:29,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:30,247][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:30,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:30,985][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:30,987][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:31,934][__main__][INFO] - Iteration 513 took 23s (38.88% Gen, 56.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 52m 1s. Estimated total time: 19h 10m 44s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 47s. [2025-11-13 11:25:31,936][__main__][INFO] - Starting iteration 513. [2025-11-13 11:25:31,940][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:31,940][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:40,945][__main__][INFO] - Number of regex retries in iteration 513: 0 [2025-11-13 11:25:40,946][__main__][INFO] - agents played in iteration 513 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:25:41,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,507][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:41,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:42,240][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:42,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:43,524][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:43,851][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:44,184][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:44,512][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:45,166][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:45,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:45,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:46,151][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:46,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:46,806][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:47,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:48,114][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:48,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:48,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:49,421][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:50,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:50,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:50,729][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:51,055][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:51,382][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:51,710][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:52,037][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:52,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:52,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:53,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:54,094][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:54,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:54,098][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:55,001][__main__][INFO] - Iteration 514 took 23s (39.05% Gen, 57.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 54m 2s. Estimated total time: 19h 13m 8s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 11s. [2025-11-13 11:25:55,003][__main__][INFO] - Starting iteration 514. [2025-11-13 11:25:55,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:55,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:03,732][__main__][INFO] - Number of regex retries in iteration 514: 0 [2025-11-13 11:26:03,732][__main__][INFO] - agents played in iteration 514 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:26:04,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:04,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:04,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:04,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:04,292][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:04,292][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:05,023][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:05,320][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:05,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:06,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:06,954][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:07,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:08,591][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:08,917][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:09,575][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:09,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:10,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:10,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:11,544][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:12,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:12,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:12,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:13,175][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:13,834][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:14,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:14,814][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:15,140][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:15,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:16,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:16,877][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:16,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:16,880][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:17,791][__main__][INFO] - Iteration 515 took 22s (38.29% Gen, 57.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 39m 45s. Estimated total time: 18h 59m 13s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 52s. [2025-11-13 11:26:17,793][__main__][INFO] - Starting iteration 515. [2025-11-13 11:26:17,796][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:17,797][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:26,243][__main__][INFO] - Number of regex retries in iteration 515: 0 [2025-11-13 11:26:26,244][__main__][INFO] - agents played in iteration 515 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:26:26,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:26,808][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:27,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:27,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:28,163][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:28,489][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:29,142][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:29,469][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:29,796][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:30,450][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:30,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:31,107][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:31,434][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:32,095][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:32,749][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:33,076][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:33,730][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:34,713][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:35,046][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:35,380][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:35,710][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:36,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:36,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:36,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:37,035][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:37,688][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:38,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:38,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:39,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:39,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:39,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:40,307][__main__][INFO] - Iteration 516 took 22s (37.53% Gen, 58.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 44s. Estimated total time: 18h 45m 35s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 31s, 500 more iterations: 3h 7m 35s. [2025-11-13 11:26:40,309][__main__][INFO] - Starting iteration 516. [2025-11-13 11:26:40,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:40,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:48,748][__main__][INFO] - Number of regex retries in iteration 516: 0 [2025-11-13 11:26:48,748][__main__][INFO] - agents played in iteration 516 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:26:49,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:49,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:49,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:49,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:49,309][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:49,309][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:50,334][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:50,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:50,997][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:51,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:51,977][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:52,631][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:52,958][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:53,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:53,940][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:54,268][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:54,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:55,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:56,231][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:56,561][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:56,888][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:58,199][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:58,525][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:58,858][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:59,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:59,512][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:59,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:00,169][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:00,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:01,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:01,894][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:01,896][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:01,898][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:02,785][__main__][INFO] - Iteration 517 took 22s (37.54% Gen, 58.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 26s. Estimated total time: 18h 43m 40s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 27s, 500 more iterations: 3h 7m 16s. [2025-11-13 11:27:02,787][__main__][INFO] - Starting iteration 517. [2025-11-13 11:27:02,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:27:02,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:10,823][__main__][INFO] - Number of regex retries in iteration 517: 0 [2025-11-13 11:27:10,824][__main__][INFO] - agents played in iteration 517 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:27:11,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:11,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:12,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:12,776][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:13,429][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:15,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:15,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:16,047][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:16,374][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:17,029][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:17,684][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:18,012][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:18,338][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:18,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:18,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:19,648][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:20,303][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:20,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:21,287][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:21,612][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:22,588][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:23,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:24,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:24,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:24,027][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:24,901][__main__][INFO] - Iteration 518 took 22s (36.33% Gen, 59.71% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 4m 58s. Estimated total time: 18h 25m 34s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 51s, 500 more iterations: 3h 4m 15s. [2025-11-13 11:27:24,903][__main__][INFO] - Starting iteration 518. [2025-11-13 11:27:24,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:27:24,907][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:33,397][__main__][INFO] - Number of regex retries in iteration 518: 0 [2025-11-13 11:27:33,398][__main__][INFO] - agents played in iteration 518 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:27:33,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:33,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:33,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:33,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:33,955][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:33,956][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:34,676][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:34,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:36,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:36,617][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:36,943][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:37,270][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:37,924][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:38,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:39,563][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:39,890][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:40,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:40,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:40,873][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:41,200][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:42,182][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:42,512][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:43,169][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:43,495][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:44,149][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:44,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:44,803][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:45,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:45,831][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:46,568][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:46,570][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:46,571][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:47,485][__main__][INFO] - Iteration 519 took 22s (37.60% Gen, 58.34% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 28m 1s. Estimated total time: 18h 49m 0s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 10s. [2025-11-13 11:27:47,488][__main__][INFO] - Starting iteration 519. [2025-11-13 11:27:47,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:27:47,491][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:55,616][__main__][INFO] - Number of regex retries in iteration 519: 0 [2025-11-13 11:27:55,617][__main__][INFO] - agents played in iteration 519 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:27:56,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:56,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:56,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:56,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:56,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:56,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:56,900][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:57,198][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:57,525][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:57,852][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:58,179][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:59,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:59,822][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:00,149][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:00,803][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:01,131][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:01,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:02,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:02,446][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:02,775][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:03,103][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:03,431][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:04,742][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:05,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:05,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:05,722][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:06,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:06,708][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:07,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:07,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:08,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:08,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:08,779][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:08,781][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:09,669][__main__][INFO] - Iteration 520 took 22s (36.64% Gen, 59.35% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 7m 37s. Estimated total time: 18h 28m 58s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 57s, 500 more iterations: 3h 4m 49s. [2025-11-13 11:28:09,671][__main__][INFO] - Starting iteration 520. [2025-11-13 11:28:09,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:28:09,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:18,411][__main__][INFO] - Number of regex retries in iteration 520: 0 [2025-11-13 11:28:18,412][__main__][INFO] - agents played in iteration 520 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:28:18,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:18,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:18,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:18,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:18,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:18,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:19,695][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:19,993][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:20,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:20,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:21,302][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:21,631][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:21,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:22,612][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:22,940][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:23,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:24,249][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:24,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:25,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:25,558][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:25,886][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:26,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:26,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:27,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:27,523][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:27,849][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:28,177][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:28,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:28,830][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:29,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:29,814][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:30,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:30,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:31,557][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:31,558][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:31,560][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:33,357][__main__][INFO] - Iteration 521 took 23s (36.60% Gen, 55.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 22m 26s. Estimated total time: 19h 44m 10s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 21s. [2025-11-13 11:28:33,359][__main__][INFO] - Starting iteration 521. [2025-11-13 11:28:33,363][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:33,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:42,347][__main__][INFO] - Number of regex retries in iteration 521: 0 [2025-11-13 11:28:42,348][__main__][INFO] - agents played in iteration 521 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:28:42,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:42,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:42,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:42,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:42,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:42,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:44,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:44,585][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:46,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:46,887][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:47,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:48,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:48,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:48,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:49,176][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:49,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:50,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:50,813][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:51,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:51,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:52,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:52,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:53,112][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:53,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:54,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:54,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:55,480][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:55,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:55,483][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:56,359][__main__][INFO] - Iteration 522 took 22s (39.07% Gen, 57.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 47m 44s. Estimated total time: 19h 9m 52s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 38s. [2025-11-13 11:28:56,361][__main__][INFO] - Starting iteration 522. [2025-11-13 11:28:56,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:56,365][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:04,803][__main__][INFO] - Number of regex retries in iteration 522: 0 [2025-11-13 11:29:04,803][__main__][INFO] - agents played in iteration 522 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:29:05,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:05,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:05,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:05,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:05,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:05,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:06,100][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:06,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:07,053][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:07,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:08,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:08,363][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:08,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:09,018][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:09,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:10,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:10,328][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:10,655][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:10,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:11,310][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:11,636][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:11,964][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:12,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:12,618][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:12,945][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:13,273][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:13,926][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:14,253][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:14,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:14,912][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:15,241][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:16,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:16,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:17,234][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:17,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:17,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:17,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:18,938][__main__][INFO] - Iteration 523 took 22s (37.38% Gen, 58.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 26m 11s. Estimated total time: 18h 48m 41s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 37s, 500 more iterations: 3h 8m 6s. [2025-11-13 11:29:18,940][__main__][INFO] - Starting iteration 523. [2025-11-13 11:29:18,943][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:18,944][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:27,677][__main__][INFO] - Number of regex retries in iteration 523: 0 [2025-11-13 11:29:27,677][__main__][INFO] - agents played in iteration 523 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:29:28,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:28,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:28,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:28,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:28,238][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:28,239][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:29,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:29,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:30,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:30,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:31,232][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:31,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:31,888][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:32,216][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:32,870][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:33,524][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:33,851][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:34,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:34,835][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:35,495][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:35,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:36,149][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:36,475][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:36,802][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:37,458][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:37,785][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:38,112][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:38,439][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:38,767][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:39,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:39,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:40,103][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:40,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:40,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:40,839][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:41,733][__main__][INFO] - Iteration 524 took 22s (38.32% Gen, 57.75% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 36m 39s. Estimated total time: 18h 59m 31s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 59s, 500 more iterations: 3h 9m 55s. [2025-11-13 11:29:41,735][__main__][INFO] - Starting iteration 524. [2025-11-13 11:29:41,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:41,739][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:49,869][__main__][INFO] - Number of regex retries in iteration 524: 0 [2025-11-13 11:29:49,870][__main__][INFO] - agents played in iteration 524 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:29:50,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:50,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:50,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:50,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:50,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:50,433][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:51,164][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:51,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:51,791][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:52,773][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:53,100][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:54,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:55,405][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:55,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:56,065][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:56,395][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:56,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:57,050][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:57,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:57,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:58,032][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:58,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:59,345][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:59,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:00,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:00,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:00,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:00,985][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:01,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:01,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:02,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:03,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:03,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:03,073][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:03,953][__main__][INFO] - Iteration 525 took 22s (36.60% Gen, 59.43% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 7m 32s. Estimated total time: 18h 30m 47s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 1s, 500 more iterations: 3h 5m 7s. [2025-11-13 11:30:03,955][__main__][INFO] - Starting iteration 525. [2025-11-13 11:30:03,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:03,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:12,359][__main__][INFO] - Number of regex retries in iteration 525: 0 [2025-11-13 11:30:12,360][__main__][INFO] - agents played in iteration 525 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:30:12,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:12,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:12,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:12,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:12,921][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:12,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:13,650][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:13,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:14,276][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:14,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:15,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:15,912][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:16,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:16,895][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:17,554][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:17,881][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:18,208][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:18,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:19,195][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:19,529][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:19,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:20,182][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:20,510][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:21,166][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:21,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:21,819][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:22,475][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:22,802][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:23,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:24,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:24,821][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:25,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:25,566][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:25,568][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:26,440][__main__][INFO] - Iteration 526 took 22s (37.37% Gen, 58.75% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 20m 29s. Estimated total time: 18h 44m 6s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 28s, 500 more iterations: 3h 7m 21s. [2025-11-13 11:30:26,442][__main__][INFO] - Starting iteration 526. [2025-11-13 11:30:26,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:26,446][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:35,608][__main__][INFO] - Number of regex retries in iteration 526: 0 [2025-11-13 11:30:35,609][__main__][INFO] - agents played in iteration 526 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:30:36,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:36,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:36,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:36,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:36,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:36,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:36,904][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:37,860][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:38,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:38,842][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:39,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:39,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:40,151][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:40,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:41,136][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:41,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:41,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:42,449][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:42,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:43,761][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:44,090][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:44,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:45,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:45,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:45,729][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:47,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:48,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:48,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:48,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:48,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:49,680][__main__][INFO] - Iteration 527 took 23s (39.43% Gen, 56.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 57m 45s. Estimated total time: 19h 21m 45s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 37s. [2025-11-13 11:30:49,682][__main__][INFO] - Starting iteration 527. [2025-11-13 11:30:49,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:49,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:58,232][__main__][INFO] - Number of regex retries in iteration 527: 0 [2025-11-13 11:30:58,232][__main__][INFO] - agents played in iteration 527 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:30:58,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:58,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:58,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:58,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:58,789][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:58,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:59,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:59,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:00,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:00,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:01,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:01,461][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:01,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:02,116][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:02,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:03,429][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:03,757][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:04,086][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:04,742][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:05,069][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:05,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:06,055][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:06,709][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:07,036][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:07,693][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:08,020][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:08,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:09,003][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:09,659][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:09,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:10,696][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:11,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:11,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:11,412][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:12,287][__main__][INFO] - Iteration 528 took 22s (37.81% Gen, 58.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 44s. Estimated total time: 18h 50m 7s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 40s, 500 more iterations: 3h 8m 21s. [2025-11-13 11:31:12,290][__main__][INFO] - Starting iteration 528. [2025-11-13 11:31:12,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:31:12,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:21,108][__main__][INFO] - Number of regex retries in iteration 528: 0 [2025-11-13 11:31:21,109][__main__][INFO] - agents played in iteration 528 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:31:21,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:21,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:21,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:21,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:21,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:21,666][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:22,691][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:23,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:23,675][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:24,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:25,311][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:25,639][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:25,967][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:26,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:26,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:26,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:27,603][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:27,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:28,918][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:29,244][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:29,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:30,226][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:30,553][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:31,209][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:31,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:31,865][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:32,519][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:32,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:33,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:34,273][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:34,288][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:34,289][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:35,222][__main__][INFO] - Iteration 529 took 22s (38.44% Gen, 57.48% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 41m 42s. Estimated total time: 19h 6m 29s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 4s. [2025-11-13 11:31:35,224][__main__][INFO] - Starting iteration 529. [2025-11-13 11:31:35,227][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:31:35,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:43,782][__main__][INFO] - Number of regex retries in iteration 529: 0 [2025-11-13 11:31:43,783][__main__][INFO] - agents played in iteration 529 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:31:44,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:44,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:44,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:44,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:44,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:44,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:45,077][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:45,375][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:46,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:46,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:47,017][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:47,345][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:47,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:47,998][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:48,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:48,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:48,983][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:49,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:49,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:49,967][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:50,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:50,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:50,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:51,927][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:52,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:53,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:53,571][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:54,225][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:55,207][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:55,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:56,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:56,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:56,959][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:56,961][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:57,829][__main__][INFO] - Iteration 530 took 22s (37.86% Gen, 58.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 24m 59s. Estimated total time: 18h 50m 8s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 40s, 500 more iterations: 3h 8m 21s. [2025-11-13 11:31:57,831][__main__][INFO] - Starting iteration 530. [2025-11-13 11:31:57,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:31:57,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:05,886][__main__][INFO] - Number of regex retries in iteration 530: 0 [2025-11-13 11:32:05,886][__main__][INFO] - agents played in iteration 530 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:32:06,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:06,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:06,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:06,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:06,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:06,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:07,177][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:07,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:08,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:08,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:09,112][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:09,445][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:09,773][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:10,427][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:11,080][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:11,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:11,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:12,062][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:12,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:13,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:13,378][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:13,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:14,034][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:14,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:14,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:15,017][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:15,346][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:16,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:17,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:17,644][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:18,337][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:19,050][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:19,052][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:19,053][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:20,784][__main__][INFO] - Iteration 531 took 22s (35.08% Gen, 57.37% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 42m 1s. Estimated total time: 19h 7m 32s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 15s. [2025-11-13 11:32:20,787][__main__][INFO] - Starting iteration 531. [2025-11-13 11:32:20,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:20,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:29,183][__main__][INFO] - Number of regex retries in iteration 531: 0 [2025-11-13 11:32:29,184][__main__][INFO] - agents played in iteration 531 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:32:29,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:29,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:29,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:29,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:29,734][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:29,735][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:30,744][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:31,071][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:31,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:32,055][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:33,040][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:33,369][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:33,698][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:34,025][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:34,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:35,005][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:35,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:35,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:36,311][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:36,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:37,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:37,623][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:37,951][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:38,279][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:38,608][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:39,266][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:39,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:39,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:40,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:41,606][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:42,320][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:42,321][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:42,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:43,157][__main__][INFO] - Iteration 532 took 22s (37.52% Gen, 58.74% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 12m 31s. Estimated total time: 18h 38m 25s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 16s, 500 more iterations: 3h 6m 24s. [2025-11-13 11:32:43,159][__main__][INFO] - Starting iteration 532. [2025-11-13 11:32:43,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:43,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:52,043][__main__][INFO] - Number of regex retries in iteration 532: 0 [2025-11-13 11:32:52,044][__main__][INFO] - agents played in iteration 532 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:32:52,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:52,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:52,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:52,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:52,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:52,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:53,624][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:53,953][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:54,609][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:54,936][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:55,264][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:55,590][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:55,917][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:56,244][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:56,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:57,224][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:57,552][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:57,878][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:58,861][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:59,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:59,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:00,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:00,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:00,837][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:01,165][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:01,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:02,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:03,133][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:03,460][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:03,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:04,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:05,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:05,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:05,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:06,071][__main__][INFO] - Iteration 533 took 22s (38.76% Gen, 57.47% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 39m 12s. Estimated total time: 19h 5m 29s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 54s. [2025-11-13 11:33:06,073][__main__][INFO] - Starting iteration 533. [2025-11-13 11:33:06,076][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:06,077][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:14,689][__main__][INFO] - Number of regex retries in iteration 533: 0 [2025-11-13 11:33:14,690][__main__][INFO] - agents played in iteration 533 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:33:15,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:15,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:15,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:15,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:15,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:15,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:15,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:16,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:17,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:17,569][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:17,895][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:18,876][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:19,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:19,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:19,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:20,511][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:21,168][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:22,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:22,808][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:23,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:24,132][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:24,459][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:25,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:26,104][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:26,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:27,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:27,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:27,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:27,868][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:28,707][__main__][INFO] - Iteration 534 took 22s (38.05% Gen, 58.23% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 24m 56s. Estimated total time: 18h 51m 35s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 35s. [2025-11-13 11:33:28,711][__main__][INFO] - Starting iteration 534. [2025-11-13 11:33:28,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:28,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:37,794][__main__][INFO] - Number of regex retries in iteration 534: 0 [2025-11-13 11:33:37,795][__main__][INFO] - agents played in iteration 534 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:33:38,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:38,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:38,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:38,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:38,355][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:38,355][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:39,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:39,378][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:39,707][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:40,034][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:40,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:41,021][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:41,348][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:41,676][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:42,003][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:42,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:42,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:42,982][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:43,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:43,637][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:43,964][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:44,618][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:44,946][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:45,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:45,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:46,261][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:46,594][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:47,256][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:47,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:48,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:49,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:49,558][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:50,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:50,995][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:50,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:50,999][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:51,869][__main__][INFO] - Iteration 535 took 23s (39.21% Gen, 57.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 50m 45s. Estimated total time: 19h 17m 48s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 58s. [2025-11-13 11:33:51,871][__main__][INFO] - Starting iteration 535. [2025-11-13 11:33:51,874][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:51,875][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:00,936][__main__][INFO] - Number of regex retries in iteration 535: 0 [2025-11-13 11:34:00,937][__main__][INFO] - agents played in iteration 535 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:34:01,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:01,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:01,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:01,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:01,504][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:01,504][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:02,233][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:02,532][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:02,860][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:03,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:03,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:03,847][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:04,175][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:04,502][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:04,828][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:05,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:05,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:05,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:06,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:06,464][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:06,792][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:07,120][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:07,774][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:08,430][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:08,756][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:09,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:09,408][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:09,737][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:10,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:10,719][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:12,357][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:12,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:13,405][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:14,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:14,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:14,121][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:14,939][__main__][INFO] - Iteration 536 took 23s (39.29% Gen, 57.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 52s. Estimated total time: 19h 13m 17s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 12s. [2025-11-13 11:34:14,941][__main__][INFO] - Starting iteration 536. [2025-11-13 11:34:14,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:14,944][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:23,457][__main__][INFO] - Number of regex retries in iteration 536: 0 [2025-11-13 11:34:23,458][__main__][INFO] - agents played in iteration 536 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:34:23,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:23,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:23,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:23,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:23,993][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:23,993][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:24,679][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:24,978][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:25,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:25,635][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:27,286][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:27,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:28,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:28,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:29,252][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:29,578][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:29,907][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:30,234][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:30,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:30,888][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:31,216][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:31,543][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:31,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:32,200][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:32,528][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:32,856][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:33,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:33,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:33,850][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:34,179][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:34,506][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:34,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:35,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:35,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:36,585][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:36,586][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:36,588][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:37,408][__main__][INFO] - Iteration 537 took 22s (37.89% Gen, 58.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 15m 26s. Estimated total time: 18h 43m 14s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 26s, 500 more iterations: 3h 7m 12s. [2025-11-13 11:34:37,410][__main__][INFO] - Starting iteration 537. [2025-11-13 11:34:37,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:37,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:46,172][__main__][INFO] - Number of regex retries in iteration 537: 0 [2025-11-13 11:34:46,173][__main__][INFO] - agents played in iteration 537 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:34:46,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:46,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:46,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:46,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:46,712][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:46,713][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:47,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:47,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:48,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:48,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:48,687][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:49,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:49,352][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:49,681][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:50,011][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:50,338][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:50,666][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:51,320][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:51,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:52,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:52,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:52,955][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:53,283][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:53,610][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:53,937][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:54,264][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:54,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:54,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:55,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:55,573][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:55,901][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:56,229][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:56,559][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:56,889][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:57,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:57,552][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:57,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:58,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:59,330][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:59,332][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:59,333][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:00,223][__main__][INFO] - Iteration 538 took 22s (38.40% Gen, 57.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 32m 23s. Estimated total time: 19h 0m 34s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 1s, 500 more iterations: 3h 10m 5s. [2025-11-13 11:35:00,225][__main__][INFO] - Starting iteration 538. [2025-11-13 11:35:00,228][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:35:00,229][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:08,381][__main__][INFO] - Number of regex retries in iteration 538: 0 [2025-11-13 11:35:08,382][__main__][INFO] - agents played in iteration 538 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:35:08,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:08,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:08,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:08,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:08,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:08,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:10,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:10,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:11,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:11,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:11,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:12,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:12,843][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:13,501][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:13,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:14,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:14,483][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:15,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:15,793][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:16,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:16,449][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:16,776][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:17,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:17,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:17,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:18,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:18,410][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:18,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:19,064][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:19,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:20,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:20,756][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:21,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:21,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:21,490][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:22,306][__main__][INFO] - Iteration 539 took 22s (36.93% Gen, 59.37% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 55m 24s. Estimated total time: 18h 23m 57s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 47s, 500 more iterations: 3h 3m 59s. [2025-11-13 11:35:22,309][__main__][INFO] - Starting iteration 539. [2025-11-13 11:35:22,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:35:22,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:31,223][__main__][INFO] - Number of regex retries in iteration 539: 0 [2025-11-13 11:35:31,224][__main__][INFO] - agents played in iteration 539 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:35:31,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:31,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:31,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:31,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:31,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:31,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:32,483][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:33,107][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:33,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:33,759][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:34,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:34,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:34,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:35,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:35,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:36,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:36,386][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:36,714][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:37,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:37,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:37,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:38,025][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:38,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:38,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:39,333][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:39,660][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:40,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:40,972][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:41,299][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:41,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:41,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:42,607][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:42,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:43,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:44,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:44,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:44,426][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:45,299][__main__][INFO] - Iteration 540 took 22s (38.76% Gen, 57.43% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 40m 28s. Estimated total time: 19h 9m 24s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 34s. [2025-11-13 11:35:45,302][__main__][INFO] - Starting iteration 540. [2025-11-13 11:35:45,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:35:45,306][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:54,062][__main__][INFO] - Number of regex retries in iteration 540: 0 [2025-11-13 11:35:54,062][__main__][INFO] - agents played in iteration 540 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:35:54,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:54,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:54,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:54,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:54,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:54,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:55,349][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:55,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:55,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:56,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:57,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:57,958][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:58,286][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:59,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:59,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:59,929][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:00,259][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:00,587][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:00,915][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:01,569][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:01,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:02,223][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:02,878][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:03,206][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:04,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:04,843][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:05,171][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:05,498][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:05,825][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:06,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:07,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:07,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:07,316][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:09,103][__main__][INFO] - Iteration 541 took 23s (36.79% Gen, 55.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 20m 35s. Estimated total time: 19h 49m 55s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 19s. [2025-11-13 11:36:09,105][__main__][INFO] - Starting iteration 541. [2025-11-13 11:36:09,108][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:09,109][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:17,980][__main__][INFO] - Number of regex retries in iteration 541: 0 [2025-11-13 11:36:17,981][__main__][INFO] - agents played in iteration 541 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:36:18,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:18,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:18,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:18,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:18,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:18,535][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:19,222][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:19,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:20,172][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:20,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:20,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:21,154][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:21,484][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:21,810][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:22,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:22,465][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:22,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:23,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:23,783][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:24,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:24,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:24,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:25,090][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:25,418][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:25,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:26,074][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:26,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:26,730][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:27,385][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:27,712][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:28,039][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:28,366][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:29,350][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:29,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:30,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:31,127][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:31,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:31,130][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:31,969][__main__][INFO] - Iteration 542 took 22s (38.81% Gen, 57.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 33m 23s. Estimated total time: 19h 3m 5s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 30s. [2025-11-13 11:36:31,971][__main__][INFO] - Starting iteration 542. [2025-11-13 11:36:31,974][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:31,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:40,726][__main__][INFO] - Number of regex retries in iteration 542: 0 [2025-11-13 11:36:40,727][__main__][INFO] - agents played in iteration 542 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:36:41,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:41,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:42,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:42,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:42,630][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:42,963][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:43,291][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:43,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:43,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:44,929][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:45,256][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:45,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:46,244][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:47,560][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:48,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:48,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:49,195][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:49,521][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:49,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:50,503][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:50,831][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:51,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:52,140][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:52,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:53,186][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:53,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:53,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:53,932][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:54,860][__main__][INFO] - Iteration 543 took 22s (38.24% Gen, 57.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 34m 14s. Estimated total time: 19h 4m 20s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 43s. [2025-11-13 11:36:54,862][__main__][INFO] - Starting iteration 543. [2025-11-13 11:36:54,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:54,866][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:03,593][__main__][INFO] - Number of regex retries in iteration 543: 0 [2025-11-13 11:37:03,594][__main__][INFO] - agents played in iteration 543 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:37:04,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:04,144][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:04,855][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:05,154][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:05,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:05,810][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:06,136][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:06,465][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:06,792][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:07,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:07,781][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:09,093][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:09,423][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:10,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:10,414][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:11,069][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:11,397][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:11,724][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:12,378][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:12,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:13,033][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:13,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:13,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:14,016][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:14,345][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:14,672][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:15,000][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:15,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:16,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:16,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:16,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:16,778][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:17,672][__main__][INFO] - Iteration 544 took 22s (38.26% Gen, 57.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 29m 53s. Estimated total time: 19h 0m 22s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 0s, 500 more iterations: 3h 10m 3s. [2025-11-13 11:37:17,674][__main__][INFO] - Starting iteration 544. [2025-11-13 11:37:17,678][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:17,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:26,776][__main__][INFO] - Number of regex retries in iteration 544: 0 [2025-11-13 11:37:26,776][__main__][INFO] - agents played in iteration 544 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:37:27,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:27,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:27,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:27,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:27,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:27,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:28,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:28,304][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:28,634][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:28,964][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:29,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:29,623][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:30,283][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:30,938][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:31,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:31,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:32,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:32,590][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:32,919][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:33,251][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:33,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:33,914][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:34,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:34,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:34,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:35,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:35,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:35,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:36,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:36,863][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:37,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:37,845][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:38,172][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:38,500][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:39,208][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:39,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:39,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:39,916][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:40,836][__main__][INFO] - Iteration 545 took 23s (39.28% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 47m 6s. Estimated total time: 19h 17m 57s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 59s. [2025-11-13 11:37:40,839][__main__][INFO] - Starting iteration 545. [2025-11-13 11:37:40,842][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:40,842][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:49,452][__main__][INFO] - Number of regex retries in iteration 545: 0 [2025-11-13 11:37:49,453][__main__][INFO] - agents played in iteration 545 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:37:49,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:49,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:49,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:49,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:49,997][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:49,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:50,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:51,006][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:51,337][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:51,994][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:52,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:52,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:52,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:53,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:53,963][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:54,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:54,947][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:55,280][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:55,608][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:55,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:56,268][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:56,595][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:56,922][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:57,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:57,577][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:58,231][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:58,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:58,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:59,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:59,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:59,869][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:00,196][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:00,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:01,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:01,901][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:02,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:02,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:02,600][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:03,478][__main__][INFO] - Iteration 546 took 22s (38.04% Gen, 58.08% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 20m 36s. Estimated total time: 18h 51m 51s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 38s. [2025-11-13 11:38:03,480][__main__][INFO] - Starting iteration 546. [2025-11-13 11:38:03,484][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:03,484][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:12,304][__main__][INFO] - Number of regex retries in iteration 546: 0 [2025-11-13 11:38:12,304][__main__][INFO] - agents played in iteration 546 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:38:12,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:12,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:12,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:12,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:12,848][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:12,849][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:13,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:14,180][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:15,170][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:15,500][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:16,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:16,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:17,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:17,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:18,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:18,472][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:18,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:19,461][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:19,795][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:20,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:21,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:22,091][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:22,419][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:22,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:23,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:23,729][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:24,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:24,770][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:25,477][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:25,479][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:25,480][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:26,326][__main__][INFO] - Iteration 547 took 22s (38.61% Gen, 57.68% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 30m 32s. Estimated total time: 19h 2m 9s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 4s, 500 more iterations: 3h 10m 21s. [2025-11-13 11:38:26,328][__main__][INFO] - Starting iteration 547. [2025-11-13 11:38:26,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:26,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:35,088][__main__][INFO] - Number of regex retries in iteration 547: 0 [2025-11-13 11:38:35,089][__main__][INFO] - agents played in iteration 547 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:38:35,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:35,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:35,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:35,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:35,646][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:35,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:36,635][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:37,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:37,621][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:37,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:38,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:38,608][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:39,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:39,927][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:40,254][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:40,581][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:41,237][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:41,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:42,223][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:42,553][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:42,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:43,208][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:43,539][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:44,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:44,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:45,178][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:45,506][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:45,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:46,161][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:46,490][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:46,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:47,522][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:48,291][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:48,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:48,295][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:49,141][__main__][INFO] - Iteration 548 took 22s (38.39% Gen, 57.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 28m 34s. Estimated total time: 19h 0m 34s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 1s, 500 more iterations: 3h 10m 5s. [2025-11-13 11:38:49,144][__main__][INFO] - Starting iteration 548. [2025-11-13 11:38:49,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:49,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:58,223][__main__][INFO] - Number of regex retries in iteration 548: 0 [2025-11-13 11:38:58,224][__main__][INFO] - agents played in iteration 548 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:38:58,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:58,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:58,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:58,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:58,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:58,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:59,479][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:00,103][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:00,429][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:00,758][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:01,086][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:01,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:01,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:02,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:02,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:03,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:03,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:04,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:05,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:05,668][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:06,323][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:06,987][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:07,319][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:07,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:08,301][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:08,628][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:08,955][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:09,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:09,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:10,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:11,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:11,351][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:11,352][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:12,180][__main__][INFO] - Iteration 549 took 23s (39.40% Gen, 56.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 39m 20s. Estimated total time: 19h 11m 43s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 57s. [2025-11-13 11:39:12,182][__main__][INFO] - Starting iteration 549. [2025-11-13 11:39:12,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:39:12,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:20,904][__main__][INFO] - Number of regex retries in iteration 549: 0 [2025-11-13 11:39:20,904][__main__][INFO] - agents played in iteration 549 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:39:21,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:21,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:21,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:21,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:21,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:21,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:22,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:22,440][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:23,093][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:23,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:23,748][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:24,076][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:24,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:25,059][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:25,386][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:25,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:26,042][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:26,368][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:26,698][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:27,028][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:27,358][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:27,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:28,020][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:28,347][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:28,675][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:29,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:29,337][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:29,667][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:29,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:30,329][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:30,658][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:30,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:31,313][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:31,640][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:32,296][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:32,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:33,349][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:34,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:34,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:34,057][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:34,916][__main__][INFO] - Iteration 550 took 22s (38.35% Gen, 57.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 51s. Estimated total time: 18h 56m 37s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 26s. [2025-11-13 11:39:34,918][__main__][INFO] - Starting iteration 550. [2025-11-13 11:39:34,921][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:39:34,922][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:44,105][__main__][INFO] - Number of regex retries in iteration 550: 0 [2025-11-13 11:39:44,106][__main__][INFO] - agents played in iteration 550 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:39:44,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:44,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:44,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:44,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:44,646][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:44,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:45,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:45,964][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:46,290][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:46,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:47,270][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:47,599][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:47,926][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:48,580][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:49,885][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:50,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:50,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:50,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:51,193][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:51,524][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:52,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:52,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:52,836][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:53,164][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:53,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:54,481][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:54,808][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:55,461][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:55,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:56,491][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:57,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:57,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:57,202][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:58,794][__main__][INFO] - Iteration 551 took 23s (38.47% Gen, 54.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 20m 30s. Estimated total time: 19h 53m 40s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 56s. [2025-11-13 11:39:58,796][__main__][INFO] - Starting iteration 551. [2025-11-13 11:39:58,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:58,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:07,395][__main__][INFO] - Number of regex retries in iteration 551: 0 [2025-11-13 11:40:07,395][__main__][INFO] - agents played in iteration 551 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:40:07,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:07,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:07,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:07,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:07,939][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:07,940][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:08,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:08,939][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:09,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:09,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:09,924][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:10,905][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:11,233][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:11,567][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:11,893][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:12,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:12,546][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:12,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:13,203][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:13,537][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:14,531][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:15,187][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:15,517][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:15,845][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:16,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:17,166][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:17,831][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:18,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:18,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:19,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:19,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:20,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:20,548][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:20,550][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:21,449][__main__][INFO] - Iteration 552 took 22s (37.95% Gen, 58.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 1s. Estimated total time: 18h 52m 33s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 45s. [2025-11-13 11:40:21,453][__main__][INFO] - Starting iteration 552. [2025-11-13 11:40:21,456][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:21,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:30,556][__main__][INFO] - Number of regex retries in iteration 552: 0 [2025-11-13 11:40:30,556][__main__][INFO] - agents played in iteration 552 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:40:30,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:31,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:31,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:31,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:31,101][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:31,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:31,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:32,428][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:32,754][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:33,082][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:33,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:33,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:34,069][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:34,724][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:35,381][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:36,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:36,368][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:36,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:37,022][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:37,348][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:37,675][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:38,001][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:38,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:38,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:39,312][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:39,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:40,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:40,622][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:40,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:41,281][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:41,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:42,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:42,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:43,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:43,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:43,675][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:44,495][__main__][INFO] - Iteration 553 took 23s (39.49% Gen, 56.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 3s. Estimated total time: 19h 11m 59s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 59s. [2025-11-13 11:40:44,497][__main__][INFO] - Starting iteration 553. [2025-11-13 11:40:44,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:44,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:54,169][__main__][INFO] - Number of regex retries in iteration 553: 0 [2025-11-13 11:40:54,169][__main__][INFO] - agents played in iteration 553 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:40:54,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:54,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:54,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:54,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:54,710][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:54,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:55,709][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:56,036][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:56,369][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:57,028][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:57,358][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:58,015][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:59,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:59,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:59,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:00,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:01,622][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:01,949][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:02,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:02,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:02,933][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:03,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:03,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:03,926][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:04,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:04,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:04,916][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:05,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:05,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:05,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:06,631][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:07,333][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:07,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:07,336][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:08,153][__main__][INFO] - Iteration 554 took 23s (40.88% Gen, 55.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 8m 23s. Estimated total time: 19h 42m 42s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 7s. [2025-11-13 11:41:08,163][__main__][INFO] - Starting iteration 554. [2025-11-13 11:41:08,166][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:08,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:17,782][__main__][INFO] - Number of regex retries in iteration 554: 0 [2025-11-13 11:41:17,782][__main__][INFO] - agents played in iteration 554 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:41:18,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:18,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:18,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:18,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:18,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:18,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:19,022][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:19,319][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:20,304][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:20,962][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:21,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:21,618][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:21,945][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:22,271][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:22,598][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:22,924][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:23,251][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:23,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:24,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:24,561][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:24,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:25,871][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:26,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:26,525][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:26,852][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:27,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:27,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:28,162][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:28,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:28,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:29,147][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:29,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:30,183][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:30,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:30,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:30,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:31,734][__main__][INFO] - Iteration 555 took 23s (40.80% Gen, 55.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 3m 43s. Estimated total time: 19h 38m 25s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 24s. [2025-11-13 11:41:31,736][__main__][INFO] - Starting iteration 555. [2025-11-13 11:41:31,739][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:31,739][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:40,934][__main__][INFO] - Number of regex retries in iteration 555: 0 [2025-11-13 11:41:40,934][__main__][INFO] - agents played in iteration 555 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:41:41,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:41,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:41,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:41,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:41,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:41,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:42,175][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:42,472][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:43,788][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:44,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:44,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:44,777][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:45,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:46,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:48,054][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:48,380][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:48,706][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:49,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:49,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:49,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:50,029][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:50,684][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:51,018][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:51,348][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:52,332][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:52,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:53,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:54,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:54,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:54,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:54,893][__main__][INFO] - Iteration 556 took 23s (39.71% Gen, 56.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 42m 39s. Estimated total time: 19h 17m 45s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 57s. [2025-11-13 11:41:54,895][__main__][INFO] - Starting iteration 556. [2025-11-13 11:41:54,898][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:54,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:04,445][__main__][INFO] - Number of regex retries in iteration 556: 0 [2025-11-13 11:42:04,446][__main__][INFO] - agents played in iteration 556 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:42:04,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:04,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:04,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:04,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:04,987][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:04,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:05,674][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:06,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:06,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:07,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:07,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:07,949][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:08,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:08,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:09,585][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:10,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:10,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:11,227][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:11,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:11,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:12,533][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:12,862][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:13,191][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:13,846][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:14,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:14,501][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:15,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:16,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:16,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:17,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:17,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:17,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:18,383][__main__][INFO] - Iteration 557 took 23s (40.65% Gen, 55.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 48s. Estimated total time: 19h 34m 17s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 42s. [2025-11-13 11:42:18,385][__main__][INFO] - Starting iteration 557. [2025-11-13 11:42:18,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:18,389][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:28,163][__main__][INFO] - Number of regex retries in iteration 557: 0 [2025-11-13 11:42:28,164][__main__][INFO] - agents played in iteration 557 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:42:28,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:28,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:28,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:28,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:28,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:28,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:29,399][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:29,696][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:30,349][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:31,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:31,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:31,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:31,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:32,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:32,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:32,967][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:33,294][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:33,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:33,947][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:34,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:34,600][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:35,253][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:35,580][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:35,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:36,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:36,560][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:36,887][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:37,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:37,883][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:38,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:38,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:38,873][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:39,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:39,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:39,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:40,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:41,263][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:41,265][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:41,267][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:42,083][__main__][INFO] - Iteration 558 took 23s (41.25% Gen, 55.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 8m 53s. Estimated total time: 19h 44m 46s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 27s. [2025-11-13 11:42:42,087][__main__][INFO] - Starting iteration 558. [2025-11-13 11:42:42,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:42,090][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:51,734][__main__][INFO] - Number of regex retries in iteration 558: 0 [2025-11-13 11:42:51,734][__main__][INFO] - agents played in iteration 558 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:42:52,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:52,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:52,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:52,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:52,289][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:52,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:52,984][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:53,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:53,610][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:53,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:54,265][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:54,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:55,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:56,231][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:56,560][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:57,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:58,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:58,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:58,868][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:59,527][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:59,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:00,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:00,839][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:01,168][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:01,496][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:02,159][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:02,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:03,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:04,178][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:04,906][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:04,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:04,909][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:05,723][__main__][INFO] - Iteration 559 took 23s (40.80% Gen, 55.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 24s. Estimated total time: 19h 41m 41s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 56s. [2025-11-13 11:43:05,725][__main__][INFO] - Starting iteration 559. [2025-11-13 11:43:05,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:43:05,728][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:15,650][__main__][INFO] - Number of regex retries in iteration 559: 0 [2025-11-13 11:43:15,651][__main__][INFO] - agents played in iteration 559 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:43:16,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:16,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:16,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:16,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:16,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:16,192][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:16,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:17,184][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:17,512][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:17,840][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:18,169][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:18,496][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:18,823][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:19,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:19,802][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:20,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:21,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:22,093][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:22,420][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:22,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:23,727][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:24,056][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:24,383][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:24,712][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:25,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:25,368][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:25,695][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:26,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:26,353][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:26,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:27,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:27,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:28,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:28,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:28,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:28,752][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:29,573][__main__][INFO] - Iteration 560 took 23s (41.61% Gen, 54.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 15m 39s. Estimated total time: 19h 52m 20s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 44s, 500 more iterations: 3h 18m 43s. [2025-11-13 11:43:29,575][__main__][INFO] - Starting iteration 560. [2025-11-13 11:43:29,578][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:43:29,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:38,736][__main__][INFO] - Number of regex retries in iteration 560: 0 [2025-11-13 11:43:38,737][__main__][INFO] - agents played in iteration 560 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:43:39,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:39,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:39,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:39,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:39,276][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:39,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:40,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:40,598][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:40,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:41,265][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:41,920][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:42,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:43,238][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:43,898][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:44,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:44,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:44,877][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:45,204][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:45,530][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:45,859][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:46,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:46,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:46,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:47,176][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:47,829][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:48,156][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:48,814][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:49,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:49,796][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:50,124][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:50,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:51,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:51,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:51,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:51,871][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:53,441][__main__][INFO] - Iteration 561 took 23s (38.37% Gen, 55.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 16m 7s. Estimated total time: 19h 53m 12s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 52s. [2025-11-13 11:43:53,443][__main__][INFO] - Starting iteration 561. [2025-11-13 11:43:53,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:53,447][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:02,889][__main__][INFO] - Number of regex retries in iteration 561: 0 [2025-11-13 11:44:02,890][__main__][INFO] - agents played in iteration 561 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:44:03,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:03,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:03,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:03,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:03,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:03,433][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:04,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:04,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:05,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:06,053][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:06,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:06,709][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:07,041][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:07,368][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:07,697][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:08,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:08,356][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:08,683][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:09,338][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:09,664][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:09,993][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:10,319][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:10,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:10,972][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:11,299][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:11,627][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:11,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:13,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:13,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:14,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:15,311][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:16,000][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:16,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:16,003][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:16,808][__main__][INFO] - Iteration 562 took 23s (40.42% Gen, 56.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 50m 41s. Estimated total time: 19h 28m 9s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 41s. [2025-11-13 11:44:16,810][__main__][INFO] - Starting iteration 562. [2025-11-13 11:44:16,814][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:16,814][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:26,771][__main__][INFO] - Number of regex retries in iteration 562: 0 [2025-11-13 11:44:26,772][__main__][INFO] - agents played in iteration 562 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:44:27,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:27,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:27,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:27,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:27,308][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:27,308][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:28,658][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:28,987][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:29,642][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:29,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:30,296][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:30,622][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:30,948][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:31,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:31,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:32,258][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:32,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:33,237][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:33,890][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:34,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:34,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:34,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:35,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:35,523][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:35,851][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:36,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:36,520][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:36,848][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:37,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:37,832][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:38,160][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:38,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:39,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:39,899][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:39,901][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:39,902][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:40,737][__main__][INFO] - Iteration 563 took 23s (41.62% Gen, 54.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 22s. Estimated total time: 19h 56m 14s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 22s. [2025-11-13 11:44:40,740][__main__][INFO] - Starting iteration 563. [2025-11-13 11:44:40,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:40,743][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:50,579][__main__][INFO] - Number of regex retries in iteration 563: 0 [2025-11-13 11:44:50,579][__main__][INFO] - agents played in iteration 563 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:44:51,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:51,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:51,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:51,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:51,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:51,129][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:51,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:52,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:52,463][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:53,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:53,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:54,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:55,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:56,065][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:56,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:57,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:57,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:58,360][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:58,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:59,016][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:59,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:59,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:00,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:00,979][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:01,307][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:01,641][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:01,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:02,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:03,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:03,704][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:03,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:03,707][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:04,525][__main__][INFO] - Iteration 564 took 23s (41.36% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 10m 55s. Estimated total time: 19h 49m 10s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 11s. [2025-11-13 11:45:04,526][__main__][INFO] - Starting iteration 564. [2025-11-13 11:45:04,530][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:04,530][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:14,412][__main__][INFO] - Number of regex retries in iteration 564: 0 [2025-11-13 11:45:14,413][__main__][INFO] - agents played in iteration 564 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:45:14,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:14,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:14,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:14,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:14,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:14,967][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:15,662][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:16,287][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:16,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:17,938][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:18,265][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:18,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:19,257][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:19,586][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:19,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:20,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:20,581][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:20,914][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:21,247][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:21,579][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:21,914][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:22,569][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:22,896][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:23,227][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:23,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:23,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:24,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:24,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:25,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:25,878][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:26,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:26,908][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:27,611][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:27,613][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:27,615][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:28,425][__main__][INFO] - Iteration 565 took 23s (41.35% Gen, 55.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 16m 11s. Estimated total time: 19h 54m 51s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 8s. [2025-11-13 11:45:28,428][__main__][INFO] - Starting iteration 565. [2025-11-13 11:45:28,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:28,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:38,078][__main__][INFO] - Number of regex retries in iteration 565: 0 [2025-11-13 11:45:38,079][__main__][INFO] - agents played in iteration 565 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:45:38,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:38,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:38,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:38,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:38,621][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:38,621][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:39,319][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:39,942][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:40,272][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:40,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:40,925][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:41,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:41,913][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:42,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:42,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:42,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:43,222][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:43,549][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:43,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:44,203][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:44,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:45,510][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:45,837][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:46,491][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:46,818][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:47,145][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:48,127][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:48,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:48,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:49,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:49,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:49,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:50,472][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:51,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:51,168][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:51,169][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:52,008][__main__][INFO] - Iteration 566 took 23s (40.92% Gen, 55.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 59m 50s. Estimated total time: 19h 38m 53s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 28s. [2025-11-13 11:45:52,009][__main__][INFO] - Starting iteration 566. [2025-11-13 11:45:52,013][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:52,013][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:01,396][__main__][INFO] - Number of regex retries in iteration 566: 0 [2025-11-13 11:46:01,397][__main__][INFO] - agents played in iteration 566 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:46:01,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:01,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:01,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:01,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:01,955][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:01,955][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:02,660][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:02,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:03,286][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:03,616][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:04,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:04,598][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:04,926][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:05,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:05,910][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:06,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:06,563][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:06,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:07,544][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:07,870][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:08,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:08,522][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:09,177][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:09,832][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:10,828][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:11,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:11,488][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:11,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:12,147][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:12,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:12,803][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:13,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:13,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:14,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:14,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:14,539][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:15,358][__main__][INFO] - Iteration 567 took 23s (40.19% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 47m 52s. Estimated total time: 19h 27m 18s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 33s. [2025-11-13 11:46:15,362][__main__][INFO] - Starting iteration 567. [2025-11-13 11:46:15,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:15,365][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:24,880][__main__][INFO] - Number of regex retries in iteration 567: 0 [2025-11-13 11:46:24,881][__main__][INFO] - agents played in iteration 567 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:46:25,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:25,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:25,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:25,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:25,428][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:25,428][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:26,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:26,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:27,401][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:27,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:28,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:28,397][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:29,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:29,711][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:30,364][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:30,690][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:31,349][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:32,006][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:32,333][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:32,991][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:33,989][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:34,649][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:35,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:35,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:36,299][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:36,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:37,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:38,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:38,038][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:38,039][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:38,918][__main__][INFO] - Iteration 568 took 23s (40.39% Gen, 55.87% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 57m 53s. Estimated total time: 19h 37m 43s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 17s. [2025-11-13 11:46:38,920][__main__][INFO] - Starting iteration 568. [2025-11-13 11:46:38,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:38,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:48,879][__main__][INFO] - Number of regex retries in iteration 568: 0 [2025-11-13 11:46:48,880][__main__][INFO] - agents played in iteration 568 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:46:49,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:49,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:49,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:49,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:49,419][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:49,419][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:50,118][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:50,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:50,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:51,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:51,398][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:51,726][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:52,055][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:52,707][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:53,034][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:53,695][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:54,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:54,680][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:55,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:55,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:55,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:55,995][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:56,322][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:56,650][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:56,979][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:57,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:57,637][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:58,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:58,630][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:59,292][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:59,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:59,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:00,275][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:00,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:01,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:02,014][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:02,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:02,016][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:02,894][__main__][INFO] - Iteration 569 took 23s (41.53% Gen, 54.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 21s. Estimated total time: 19h 58m 35s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 45s. [2025-11-13 11:47:02,897][__main__][INFO] - Starting iteration 569. [2025-11-13 11:47:02,899][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:47:02,900][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:12,256][__main__][INFO] - Number of regex retries in iteration 569: 0 [2025-11-13 11:47:12,256][__main__][INFO] - agents played in iteration 569 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:47:12,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:12,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:12,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:12,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:12,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:12,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:13,484][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:13,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:14,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:14,441][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:14,768][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:15,095][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:15,423][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:16,078][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:16,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:16,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:17,385][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:18,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:18,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:19,025][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:19,678][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:20,005][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:20,994][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:21,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:21,986][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:22,313][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:22,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:23,297][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:23,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:24,672][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:25,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:25,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:25,371][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:26,212][__main__][INFO] - Iteration 570 took 23s (40.13% Gen, 56.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 3s. Estimated total time: 19h 25m 40s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 16s. [2025-11-13 11:47:26,214][__main__][INFO] - Starting iteration 570. [2025-11-13 11:47:26,218][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:47:26,219][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:35,322][__main__][INFO] - Number of regex retries in iteration 570: 0 [2025-11-13 11:47:35,322][__main__][INFO] - agents played in iteration 570 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:47:35,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,867][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:35,867][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:36,577][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:37,857][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:38,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:39,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:40,154][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:40,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:41,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:41,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:42,120][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:42,447][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:42,774][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:43,102][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:43,430][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:43,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:44,745][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:45,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:45,728][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:46,057][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:46,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:46,712][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:47,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:47,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:48,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:48,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:48,443][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:50,205][__main__][INFO] - Iteration 571 took 23s (37.95% Gen, 54.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 22s. Estimated total time: 19h 59m 23s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 53s. [2025-11-13 11:47:50,207][__main__][INFO] - Starting iteration 571. [2025-11-13 11:47:50,209][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:50,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:59,804][__main__][INFO] - Number of regex retries in iteration 571: 0 [2025-11-13 11:47:59,805][__main__][INFO] - agents played in iteration 571 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:48:00,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:00,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:00,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:00,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:00,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:00,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:02,008][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:02,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:02,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:02,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:03,322][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:03,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:04,311][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:04,640][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:04,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:05,632][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:05,967][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:06,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:06,620][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:06,952][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:07,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:08,273][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:09,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:09,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:09,913][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:10,569][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:11,224][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:11,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:12,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:12,957][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:12,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:12,959][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:13,795][__main__][INFO] - Iteration 572 took 23s (40.68% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 57m 55s. Estimated total time: 19h 39m 20s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 33s. [2025-11-13 11:48:13,797][__main__][INFO] - Starting iteration 572. [2025-11-13 11:48:13,800][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:13,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:21,794][__main__][INFO] - Number of regex retries in iteration 572: 0 [2025-11-13 11:48:21,795][__main__][INFO] - agents played in iteration 572 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:48:22,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:22,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:22,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:23,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:23,329][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:23,663][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:23,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:24,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:24,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:24,982][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:25,636][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:26,289][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:26,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:27,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:27,928][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:28,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:28,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:28,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:29,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:29,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:29,896][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:30,555][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:31,879][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:32,208][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:32,869][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:33,203][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:33,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:34,253][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:34,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:34,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:34,948][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:35,862][__main__][INFO] - Iteration 573 took 22s (36.23% Gen, 59.62% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 41m 22s. Estimated total time: 18h 23m 9s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 46s, 500 more iterations: 3h 3m 51s. [2025-11-13 11:48:35,864][__main__][INFO] - Starting iteration 573. [2025-11-13 11:48:35,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:35,867][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:45,417][__main__][INFO] - Number of regex retries in iteration 573: 0 [2025-11-13 11:48:45,418][__main__][INFO] - agents played in iteration 573 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:48:45,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:45,960][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:46,655][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:46,953][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:47,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:48,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:48,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:48,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:49,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:50,248][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:50,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:50,901][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:51,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:51,559][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:51,885][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:52,212][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:52,538][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:52,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:53,846][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:54,173][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:54,500][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:54,826][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:55,152][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:55,480][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:55,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:56,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:56,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:57,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:57,859][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:58,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:58,556][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:58,557][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:59,410][__main__][INFO] - Iteration 574 took 23s (40.56% Gen, 55.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 55m 2s. Estimated total time: 19h 37m 12s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 12s. [2025-11-13 11:48:59,412][__main__][INFO] - Starting iteration 574. [2025-11-13 11:48:59,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:59,416][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:09,102][__main__][INFO] - Number of regex retries in iteration 574: 0 [2025-11-13 11:49:09,103][__main__][INFO] - agents played in iteration 574 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:49:09,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,656][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:09,657][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:10,365][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:10,662][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:11,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:11,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:11,982][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:12,316][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:12,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:12,972][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:13,626][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:13,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:14,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:14,610][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:14,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:15,264][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:15,590][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:15,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:16,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:16,568][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:16,894][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:17,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:17,548][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:17,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:18,202][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:18,528][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:18,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:19,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:20,165][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:20,492][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:20,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:21,530][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:22,242][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:22,243][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:22,244][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:23,076][__main__][INFO] - Iteration 575 took 23s (40.94% Gen, 55.54% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 33s. Estimated total time: 19h 43m 7s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 11s. [2025-11-13 11:49:23,078][__main__][INFO] - Starting iteration 575. [2025-11-13 11:49:23,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:23,082][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:33,133][__main__][INFO] - Number of regex retries in iteration 575: 0 [2025-11-13 11:49:33,133][__main__][INFO] - agents played in iteration 575 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:49:33,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:33,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:33,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:33,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:33,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:33,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:34,384][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:34,681][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:35,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:36,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:37,638][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:37,964][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:38,290][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:38,617][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:38,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:39,268][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:39,594][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:39,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:40,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:40,574][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:40,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:41,228][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:41,554][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:41,880][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:42,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:42,533][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:42,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:43,187][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:43,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:43,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:44,169][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:44,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:44,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:45,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:46,252][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:46,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:46,255][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:47,099][__main__][INFO] - Iteration 576 took 24s (41.85% Gen, 54.63% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 17m 57s. Estimated total time: 20h 0m 56s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 9s. [2025-11-13 11:49:47,101][__main__][INFO] - Starting iteration 576. [2025-11-13 11:49:47,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:47,105][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:56,980][__main__][INFO] - Number of regex retries in iteration 576: 0 [2025-11-13 11:49:56,980][__main__][INFO] - agents played in iteration 576 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:49:57,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:57,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:57,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:57,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:57,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:57,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:58,219][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:58,516][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:59,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:59,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:00,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:00,827][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:01,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:01,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:02,141][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:02,797][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:03,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:03,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:03,784][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:04,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:04,440][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:04,767][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:05,096][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:05,429][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:06,082][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:07,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:07,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:08,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:08,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:08,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:09,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:10,140][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:10,142][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:10,144][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:10,980][__main__][INFO] - Iteration 577 took 23s (41.36% Gen, 55.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 10m 31s. Estimated total time: 19h 53m 52s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 58s. [2025-11-13 11:50:10,983][__main__][INFO] - Starting iteration 577. [2025-11-13 11:50:10,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:10,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:20,373][__main__][INFO] - Number of regex retries in iteration 577: 0 [2025-11-13 11:50:20,374][__main__][INFO] - agents played in iteration 577 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:50:20,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:20,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:20,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:20,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:20,924][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:20,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:22,277][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:22,611][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:22,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:23,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:23,595][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:23,923][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:24,249][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:24,575][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:24,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:25,228][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:25,554][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:25,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:26,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:26,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:26,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:28,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:28,823][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:29,149][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:29,478][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:30,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:31,114][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:31,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:32,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:32,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:33,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:33,508][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:33,509][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:34,335][__main__][INFO] - Iteration 578 took 23s (40.20% Gen, 56.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 43m 45s. Estimated total time: 19h 27m 30s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 35s. [2025-11-13 11:50:34,337][__main__][INFO] - Starting iteration 578. [2025-11-13 11:50:34,340][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:34,340][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:43,659][__main__][INFO] - Number of regex retries in iteration 578: 0 [2025-11-13 11:50:43,659][__main__][INFO] - agents played in iteration 578 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:50:44,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:44,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:44,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:44,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:44,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:44,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:44,921][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:45,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:45,547][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:45,876][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:46,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:46,536][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:46,864][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:47,527][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:47,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:48,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:48,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:48,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:49,829][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:50,811][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:51,138][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:52,121][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:52,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:52,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:53,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:53,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:53,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:54,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:55,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:56,098][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:56,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:56,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:56,800][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:57,615][__main__][INFO] - Iteration 579 took 23s (40.03% Gen, 56.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 39m 40s. Estimated total time: 19h 23m 48s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 58s. [2025-11-13 11:50:57,617][__main__][INFO] - Starting iteration 579. [2025-11-13 11:50:57,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:57,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:06,983][__main__][INFO] - Number of regex retries in iteration 579: 0 [2025-11-13 11:51:06,984][__main__][INFO] - agents played in iteration 579 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:51:07,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:07,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:07,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:07,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:07,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:07,539][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:08,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:09,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:09,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:09,890][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:11,199][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:11,852][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:12,180][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:12,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:12,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:13,161][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:13,486][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:13,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:15,796][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:16,125][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:16,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:16,785][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:17,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:17,442][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:18,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:18,423][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:18,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:19,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:20,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:20,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:20,176][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:21,010][__main__][INFO] - Iteration 580 took 23s (40.03% Gen, 56.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 1s. Estimated total time: 19h 29m 33s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 55s. [2025-11-13 11:51:21,011][__main__][INFO] - Starting iteration 580. [2025-11-13 11:51:21,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:51:21,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:30,956][__main__][INFO] - Number of regex retries in iteration 580: 0 [2025-11-13 11:51:30,957][__main__][INFO] - agents played in iteration 580 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:51:31,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:31,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:31,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:31,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:31,521][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:31,521][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:32,241][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:33,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:33,532][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:33,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:34,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:34,512][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:35,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:35,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:35,821][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:36,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:36,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:37,789][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:38,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:38,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:40,075][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:40,400][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:41,053][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:41,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:42,033][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:42,360][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:42,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:43,398][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:44,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:44,124][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:44,126][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:45,849][__main__][INFO] - Iteration 581 took 24s (40.03% Gen, 53.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 56m 49s. Estimated total time: 20h 41m 45s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 23s, 500 more iterations: 3h 26m 57s. [2025-11-13 11:51:45,851][__main__][INFO] - Starting iteration 581. [2025-11-13 11:51:45,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:45,854][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:56,239][__main__][INFO] - Number of regex retries in iteration 581: 0 [2025-11-13 11:51:56,240][__main__][INFO] - agents played in iteration 581 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:51:56,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:56,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:56,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:56,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:56,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:56,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:57,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:58,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:58,788][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:59,444][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:59,771][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:00,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:00,748][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:01,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:01,726][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:02,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:03,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:03,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:03,685][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:04,011][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:04,663][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:04,989][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:05,316][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:05,642][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:05,969][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:06,623][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:07,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:07,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:07,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:08,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:09,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:09,375][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:09,377][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:10,238][__main__][INFO] - Iteration 582 took 24s (42.59% Gen, 53.87% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 33m 54s. Estimated total time: 20h 19m 15s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 38s, 500 more iterations: 3h 23m 12s. [2025-11-13 11:52:10,240][__main__][INFO] - Starting iteration 582. [2025-11-13 11:52:10,243][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:10,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:19,571][__main__][INFO] - Number of regex retries in iteration 582: 0 [2025-11-13 11:52:19,571][__main__][INFO] - agents played in iteration 582 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:52:20,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:20,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:20,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:20,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:20,136][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:20,137][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:20,865][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:21,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:21,496][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:21,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:22,152][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:22,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:22,808][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:24,120][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:24,450][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:24,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:25,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:25,441][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:25,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:26,100][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:26,429][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:26,756][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:27,085][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:27,744][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:28,403][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:28,734][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:29,061][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:29,388][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:29,717][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:30,044][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:30,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:30,708][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:31,037][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:31,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:32,090][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:32,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:32,823][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:32,825][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:33,702][__main__][INFO] - Iteration 583 took 23s (39.76% Gen, 56.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 47m 15s. Estimated total time: 19h 33m 0s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 30s. [2025-11-13 11:52:33,704][__main__][INFO] - Starting iteration 583. [2025-11-13 11:52:33,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:33,708][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:43,247][__main__][INFO] - Number of regex retries in iteration 583: 0 [2025-11-13 11:52:43,248][__main__][INFO] - agents played in iteration 583 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:52:43,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:43,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:43,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:43,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:43,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:43,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:44,824][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:45,154][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:45,811][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:46,137][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:46,464][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:47,120][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:47,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:47,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:48,103][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:48,431][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:49,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:50,080][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:50,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:51,062][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:51,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:51,717][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:52,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:52,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:53,025][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:53,353][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:53,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:54,007][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:54,333][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:54,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:55,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:56,380][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:56,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:56,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:57,287][__main__][INFO] - Iteration 584 took 23s (40.46% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 52m 54s. Estimated total time: 19h 39m 2s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 30s. [2025-11-13 11:52:57,289][__main__][INFO] - Starting iteration 584. [2025-11-13 11:52:57,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:57,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:06,926][__main__][INFO] - Number of regex retries in iteration 584: 0 [2025-11-13 11:53:06,927][__main__][INFO] - agents played in iteration 584 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:53:07,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:07,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:07,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:07,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:07,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:07,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:08,209][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:08,508][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:08,837][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:09,167][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:09,496][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:09,826][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:10,490][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:10,819][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:11,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:11,484][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:11,811][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:12,472][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:12,801][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:13,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:13,455][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:13,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:14,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:14,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:15,416][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:15,744][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:16,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:16,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:17,385][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:17,711][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:18,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:18,367][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:18,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:19,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:20,101][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:20,102][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:20,104][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:21,072][__main__][INFO] - Iteration 585 took 23s (40.51% Gen, 55.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 2m 27s. Estimated total time: 19h 48m 59s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 9s. [2025-11-13 11:53:21,074][__main__][INFO] - Starting iteration 585. [2025-11-13 11:53:21,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:21,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:30,561][__main__][INFO] - Number of regex retries in iteration 585: 0 [2025-11-13 11:53:30,562][__main__][INFO] - agents played in iteration 585 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:53:30,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:31,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:31,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:31,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:31,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:31,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:31,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:32,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:32,805][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:33,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:33,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:34,122][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:34,789][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:35,448][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:35,782][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:36,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:36,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:36,772][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:37,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:38,085][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:38,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:39,072][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:39,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:40,378][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:41,031][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:41,684][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:42,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:43,009][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:43,744][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:43,745][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:43,747][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:44,659][__main__][INFO] - Iteration 586 took 23s (40.21% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 52m 11s. Estimated total time: 19h 39m 7s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 31s. [2025-11-13 11:53:44,661][__main__][INFO] - Starting iteration 586. [2025-11-13 11:53:44,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:44,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:54,202][__main__][INFO] - Number of regex retries in iteration 586: 0 [2025-11-13 11:53:54,203][__main__][INFO] - agents played in iteration 586 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:53:54,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:54,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:54,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:54,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:54,761][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:54,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:55,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:56,121][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:56,776][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:57,760][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:58,414][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:58,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:59,399][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:59,727][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:00,059][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:00,714][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:01,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:01,696][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:02,024][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:02,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:03,012][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:03,676][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:04,339][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:04,998][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:05,331][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:05,660][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:05,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:06,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:07,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:07,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:07,388][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:08,282][__main__][INFO] - Iteration 587 took 23s (40.38% Gen, 55.83% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 53m 35s. Estimated total time: 19h 40m 55s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 11:54:08,284][__main__][INFO] - Starting iteration 587. [2025-11-13 11:54:08,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:08,289][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:17,353][__main__][INFO] - Number of regex retries in iteration 587: 0 [2025-11-13 11:54:17,354][__main__][INFO] - agents played in iteration 587 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:54:17,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:17,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:17,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:17,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:17,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:17,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:18,652][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:18,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:19,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:19,606][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:20,589][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:20,922][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:21,253][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:21,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:21,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:22,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:22,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:23,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:23,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:23,886][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:24,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:24,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:25,847][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:26,180][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:26,840][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:27,169][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:27,495][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:27,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:28,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:28,803][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:29,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:29,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:30,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:30,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:30,543][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:31,444][__main__][INFO] - Iteration 588 took 23s (39.15% Gen, 56.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 30m 9s. Estimated total time: 19h 17m 51s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 58s. [2025-11-13 11:54:31,447][__main__][INFO] - Starting iteration 588. [2025-11-13 11:54:31,450][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:31,451][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:39,869][__main__][INFO] - Number of regex retries in iteration 588: 0 [2025-11-13 11:54:39,870][__main__][INFO] - agents played in iteration 588 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:54:40,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:40,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:40,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:40,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:40,425][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:40,426][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:41,179][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:41,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:41,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:42,134][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:42,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:42,790][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:43,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:44,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:44,438][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:45,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:45,420][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:46,077][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:46,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:46,742][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:47,071][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:47,399][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:48,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:48,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:49,043][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:49,373][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:49,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:50,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:50,691][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:51,018][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:51,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:52,354][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:53,109][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:53,110][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:53,112][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:54,045][__main__][INFO] - Iteration 589 took 22s (37.26% Gen, 58.60% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 1m 41s. Estimated total time: 18h 49m 46s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 39s, 500 more iterations: 3h 8m 17s. [2025-11-13 11:54:54,047][__main__][INFO] - Starting iteration 589. [2025-11-13 11:54:54,051][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:54,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:01,994][__main__][INFO] - Number of regex retries in iteration 589: 0 [2025-11-13 11:55:01,994][__main__][INFO] - agents played in iteration 589 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:55:02,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:02,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:02,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:02,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:02,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:02,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:03,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:03,616][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:03,945][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:04,273][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:04,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:05,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:06,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:06,573][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:06,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:07,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:07,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:07,887][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:08,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:08,873][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:09,530][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:10,185][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:10,513][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:10,846][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:11,173][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:11,501][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:12,157][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:12,484][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:12,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:13,468][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:13,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:14,471][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:15,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:15,206][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:15,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:16,085][__main__][INFO] - Iteration 590 took 22s (36.05% Gen, 59.97% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 33m 18s. Estimated total time: 18h 21m 45s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 43s, 500 more iterations: 3h 3m 37s. [2025-11-13 11:55:16,087][__main__][INFO] - Starting iteration 590. [2025-11-13 11:55:16,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:55:16,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:24,943][__main__][INFO] - Number of regex retries in iteration 590: 0 [2025-11-13 11:55:24,943][__main__][INFO] - agents played in iteration 590 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:55:25,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:25,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:25,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:25,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:25,498][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:25,499][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:26,233][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:26,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:27,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:27,842][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:28,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:28,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:28,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:29,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:29,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:29,807][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:30,135][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:30,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:31,122][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:31,784][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:32,111][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:32,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:33,092][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:33,420][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:33,746][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:34,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:34,740][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:35,067][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:35,394][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:35,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:36,048][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:36,377][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:36,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:37,382][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:38,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:38,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:38,119][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:39,942][__main__][INFO] - Iteration 591 took 23s (37.11% Gen, 55.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 3m 46s. Estimated total time: 19h 52m 37s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 45s, 500 more iterations: 3h 18m 46s. [2025-11-13 11:55:39,944][__main__][INFO] - Starting iteration 591. [2025-11-13 11:55:39,948][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:39,948][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:48,403][__main__][INFO] - Number of regex retries in iteration 591: 0 [2025-11-13 11:55:48,404][__main__][INFO] - agents played in iteration 591 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:55:48,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:48,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:48,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:48,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:48,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:48,962][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:49,710][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:50,664][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:50,991][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:51,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:51,644][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:51,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:52,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:52,957][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:53,285][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:53,612][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:54,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:55,923][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:56,252][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:56,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:57,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:57,571][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:58,228][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:58,883][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:59,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:59,540][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:00,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:00,880][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:01,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:01,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:01,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:02,575][__main__][INFO] - Iteration 592 took 22s (37.37% Gen, 58.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 2m 11s. Estimated total time: 18h 51m 24s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 42s, 500 more iterations: 3h 8m 34s. [2025-11-13 11:56:02,577][__main__][INFO] - Starting iteration 592. [2025-11-13 11:56:02,581][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:02,582][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:11,359][__main__][INFO] - Number of regex retries in iteration 592: 0 [2025-11-13 11:56:11,359][__main__][INFO] - agents played in iteration 592 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:56:11,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:11,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:12,970][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:13,625][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:13,953][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:14,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:14,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:14,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:15,262][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:15,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:15,917][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:16,244][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:16,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:16,899][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:17,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:17,882][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:18,546][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:18,873][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:19,202][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:19,533][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:20,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:20,849][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:21,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:21,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:23,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:23,847][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:24,587][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:24,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:24,590][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:25,511][__main__][INFO] - Iteration 593 took 22s (38.27% Gen, 57.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 57s. Estimated total time: 19h 6m 33s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 5s. [2025-11-13 11:56:25,514][__main__][INFO] - Starting iteration 593. [2025-11-13 11:56:25,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:25,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:34,557][__main__][INFO] - Number of regex retries in iteration 593: 0 [2025-11-13 11:56:34,558][__main__][INFO] - agents played in iteration 593 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:56:34,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:35,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:35,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:35,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:35,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:35,124][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:35,863][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:36,162][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:36,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:36,817][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:37,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:37,472][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:37,799][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:38,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:38,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:39,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:39,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:41,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:41,405][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:41,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:42,062][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:42,389][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:42,724][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:43,058][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:43,722][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:44,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:44,389][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:44,719][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:45,707][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:46,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:47,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:47,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:47,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:47,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:48,735][__main__][INFO] - Iteration 594 took 23s (38.93% Gen, 57.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 30m 57s. Estimated total time: 19h 20m 57s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 29s. [2025-11-13 11:56:48,737][__main__][INFO] - Starting iteration 594. [2025-11-13 11:56:48,741][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:48,741][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:57,450][__main__][INFO] - Number of regex retries in iteration 594: 0 [2025-11-13 11:56:57,450][__main__][INFO] - agents played in iteration 594 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:56:57,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:57,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:57,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:58,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:58,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:58,009][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:59,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:59,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:59,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:00,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:00,366][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:01,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:01,348][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:01,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:02,330][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:02,986][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:03,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:03,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:03,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:04,300][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:04,628][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:04,957][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:05,284][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:05,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:05,938][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:06,268][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:06,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:07,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:07,584][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:07,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:08,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:09,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:09,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:10,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:10,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:10,675][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:11,648][__main__][INFO] - Iteration 595 took 22s (38.02% Gen, 57.73% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 15m 2s. Estimated total time: 19h 5m 25s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 54s. [2025-11-13 11:57:11,650][__main__][INFO] - Starting iteration 595. [2025-11-13 11:57:11,654][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:11,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:20,266][__main__][INFO] - Number of regex retries in iteration 595: 0 [2025-11-13 11:57:20,266][__main__][INFO] - agents played in iteration 595 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:57:20,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,826][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:20,827][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:22,202][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:22,858][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:23,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:23,841][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:24,169][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:24,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:24,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:25,155][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:25,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:25,810][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:26,139][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:26,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:26,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:27,450][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:27,777][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:28,105][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:28,433][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:30,074][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:30,402][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:30,728][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:31,057][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:31,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:32,037][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:32,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:33,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:33,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:33,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:34,406][__main__][INFO] - Iteration 596 took 22s (37.85% Gen, 58.11% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 6m 53s. Estimated total time: 18h 57m 38s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 55s, 500 more iterations: 3h 9m 36s. [2025-11-13 11:57:34,407][__main__][INFO] - Starting iteration 596. [2025-11-13 11:57:34,411][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:34,411][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:42,984][__main__][INFO] - Number of regex retries in iteration 596: 0 [2025-11-13 11:57:42,985][__main__][INFO] - agents played in iteration 596 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:57:43,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,547][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:43,547][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:44,281][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:44,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:44,907][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:46,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:46,543][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:46,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:47,198][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:47,525][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:47,852][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:48,180][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:48,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:49,490][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:49,818][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:50,470][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:50,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:51,452][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:52,107][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:52,768][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:53,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:54,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:54,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:55,465][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:56,213][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:56,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:56,216][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:57,145][__main__][INFO] - Iteration 597 took 22s (37.71% Gen, 58.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 5m 37s. Estimated total time: 18h 56m 45s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 27s. [2025-11-13 11:57:57,147][__main__][INFO] - Starting iteration 597. [2025-11-13 11:57:57,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:57,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:05,653][__main__][INFO] - Number of regex retries in iteration 597: 0 [2025-11-13 11:58:05,654][__main__][INFO] - agents played in iteration 597 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:58:06,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:06,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:06,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:06,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:06,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:06,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:06,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:07,262][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:07,589][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:07,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:08,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:09,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:09,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:09,881][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:10,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:10,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:11,519][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:11,847][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:12,174][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:12,827][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:13,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:13,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:13,806][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:14,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:15,115][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:15,442][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:15,769][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:16,428][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:17,090][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:17,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:18,161][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:18,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:18,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:18,925][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:19,850][__main__][INFO] - Iteration 598 took 22s (37.46% Gen, 58.46% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 29s. Estimated total time: 18h 55m 0s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 10s. [2025-11-13 11:58:19,852][__main__][INFO] - Starting iteration 598. [2025-11-13 11:58:19,856][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:58:19,857][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:28,898][__main__][INFO] - Number of regex retries in iteration 598: 0 [2025-11-13 11:58:28,899][__main__][INFO] - agents played in iteration 598 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:58:29,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:29,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:29,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:29,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:29,450][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:29,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:30,198][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:30,498][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:30,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:31,152][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:31,480][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:31,813][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:32,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:32,471][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:32,798][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:33,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:33,784][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:34,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:34,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:35,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:35,426][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:35,753][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:36,080][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:36,407][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:36,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:37,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:37,390][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:37,717][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:38,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:39,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:39,685][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:40,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:41,384][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:42,140][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:42,141][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:42,143][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:43,104][__main__][INFO] - Iteration 599 took 23s (38.89% Gen, 56.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 30m 33s. Estimated total time: 19h 22m 27s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 44s. [2025-11-13 11:58:43,106][__main__][INFO] - Starting iteration 599. [2025-11-13 11:58:43,109][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:58:43,110][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:51,789][__main__][INFO] - Number of regex retries in iteration 599: 0 [2025-11-13 11:58:51,789][__main__][INFO] - agents played in iteration 599 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:58:52,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:52,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:52,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:52,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:52,341][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:52,342][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:53,069][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:54,025][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:54,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:54,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:55,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:56,647][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:56,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:57,305][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:57,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:57,960][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:58,289][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:58,616][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:58,943][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:59,598][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:59,925][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:00,252][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:00,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:00,908][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:01,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:02,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:02,545][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:02,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:03,198][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:03,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:04,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:04,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:04,942][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:04,943][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:06,082][__main__][INFO] - Iteration 600 took 22s (37.78% Gen, 57.26% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 25s. Estimated total time: 19h 8m 42s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 27s. [2025-11-13 11:59:06,085][__main__][INFO] - Starting iteration 600. [2025-11-13 11:59:06,088][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:59:06,088][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:14,497][__main__][INFO] - Number of regex retries in iteration 600: 0 [2025-11-13 11:59:14,497][__main__][INFO] - agents played in iteration 600 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:59:14,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:14,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:15,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:15,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:15,046][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:15,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:15,767][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:16,395][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:16,723][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:17,051][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:17,380][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:17,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:18,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:18,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:18,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:19,020][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:19,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:19,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:20,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:20,331][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:20,659][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:20,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:21,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:21,970][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:22,298][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:22,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:22,953][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:23,283][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:23,610][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:23,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:24,265][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:24,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:24,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:25,250][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:25,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:25,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:26,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:26,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:27,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:27,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:27,673][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:30,533][__main__][INFO] - Iteration 601 took 24s (34.40% Gen, 53.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 29m 38s. Estimated total time: 20h 22m 19s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 44s, 500 more iterations: 3h 23m 43s. [2025-11-13 11:59:30,535][__main__][INFO] - Starting iteration 601. [2025-11-13 11:59:30,539][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:30,540][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:39,286][__main__][INFO] - Number of regex retries in iteration 601: 0 [2025-11-13 11:59:39,287][__main__][INFO] - agents played in iteration 601 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 11:59:39,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:39,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:39,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:39,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:39,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:39,839][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:40,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:41,204][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:41,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:41,860][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:42,188][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:42,521][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:42,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:43,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:43,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:44,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:44,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:44,818][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:45,146][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:45,474][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:46,129][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:46,458][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:46,786][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:47,441][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:47,768][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:48,422][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:48,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:49,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:49,737][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:50,401][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:50,727][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:51,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:51,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:52,475][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:52,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:52,478][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:53,403][__main__][INFO] - Iteration 602 took 22s (38.26% Gen, 57.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 10m 10s. Estimated total time: 19h 3m 14s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 32s. [2025-11-13 11:59:53,405][__main__][INFO] - Starting iteration 602. [2025-11-13 11:59:53,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:53,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:02,316][__main__][INFO] - Number of regex retries in iteration 602: 0 [2025-11-13 12:00:02,317][__main__][INFO] - agents played in iteration 602 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:00:02,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:02,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:02,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:02,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:02,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:02,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:03,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:03,894][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:04,550][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:04,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:05,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:05,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:06,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:06,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:06,848][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:07,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:07,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:07,830][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:08,158][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:08,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:09,142][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:09,469][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:09,799][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:10,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:10,454][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:11,435][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:11,762][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:12,090][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:12,418][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:12,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:13,074][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:13,408][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:14,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:14,779][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:15,515][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:15,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:15,519][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:16,605][__main__][INFO] - Iteration 603 took 23s (38.40% Gen, 56.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 26m 26s. Estimated total time: 19h 19m 54s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 19s. [2025-11-13 12:00:16,607][__main__][INFO] - Starting iteration 603. [2025-11-13 12:00:16,611][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:16,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:25,396][__main__][INFO] - Number of regex retries in iteration 603: 0 [2025-11-13 12:00:25,396][__main__][INFO] - agents played in iteration 603 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:00:25,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:25,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:25,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:25,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:25,951][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:25,951][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:26,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:27,617][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:27,946][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:28,272][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:28,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:28,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:29,917][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:30,248][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:30,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:30,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:31,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:31,566][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:31,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:32,222][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:32,549][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:32,876][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:33,205][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:33,532][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:33,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:34,186][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:34,514][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:35,171][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:35,500][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:35,832][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:37,148][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:37,845][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:38,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:38,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:38,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:39,498][__main__][INFO] - Iteration 604 took 22s (38.38% Gen, 57.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 10m 32s. Estimated total time: 19h 4m 23s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 43s. [2025-11-13 12:00:39,500][__main__][INFO] - Starting iteration 604. [2025-11-13 12:00:39,503][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:39,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:48,537][__main__][INFO] - Number of regex retries in iteration 604: 0 [2025-11-13 12:00:48,538][__main__][INFO] - agents played in iteration 604 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:00:48,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:49,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:49,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:49,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:49,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:49,090][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:49,823][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:50,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:50,781][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:51,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:51,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:52,107][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:52,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:52,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:53,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:53,427][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:53,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:55,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:55,403][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:56,056][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:57,038][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:57,366][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:57,697][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:58,351][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:58,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:59,333][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:00,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:01,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:01,773][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:01,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:01,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:02,743][__main__][INFO] - Iteration 605 took 23s (38.87% Gen, 57.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 50s. Estimated total time: 19h 22m 4s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 40s. [2025-11-13 12:01:02,746][__main__][INFO] - Starting iteration 605. [2025-11-13 12:01:02,749][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:02,750][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:11,068][__main__][INFO] - Number of regex retries in iteration 605: 0 [2025-11-13 12:01:11,069][__main__][INFO] - agents played in iteration 605 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:01:11,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:11,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:11,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:11,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:11,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:11,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:12,349][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:12,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:12,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:13,301][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:13,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:13,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:14,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:14,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:15,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:16,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:16,915][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:17,244][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:17,577][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:18,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:19,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:20,212][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:20,539][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:20,868][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:21,198][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:21,526][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:21,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:22,509][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:22,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:23,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:24,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:24,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:24,261][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:25,203][__main__][INFO] - Iteration 606 took 22s (37.04% Gen, 58.75% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 48m 8s. Estimated total time: 18h 42m 44s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 25s, 500 more iterations: 3h 7m 7s. [2025-11-13 12:01:25,207][__main__][INFO] - Starting iteration 606. [2025-11-13 12:01:25,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:25,211][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:33,859][__main__][INFO] - Number of regex retries in iteration 606: 0 [2025-11-13 12:01:33,859][__main__][INFO] - agents played in iteration 606 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:01:34,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:34,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:34,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:34,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:34,425][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:34,425][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:35,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:36,039][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:36,367][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:36,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:37,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:37,349][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:37,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:38,008][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:38,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:39,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:39,982][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:40,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:40,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:41,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:41,621][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:41,947][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:42,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:42,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:42,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:43,913][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:44,241][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:44,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:45,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:45,552][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:45,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:46,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:47,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:47,321][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:47,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:48,460][__main__][INFO] - Iteration 607 took 23s (37.19% Gen, 57.90% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 27m 32s. Estimated total time: 19h 22m 32s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 45s. [2025-11-13 12:01:48,462][__main__][INFO] - Starting iteration 607. [2025-11-13 12:01:48,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:48,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:56,466][__main__][INFO] - Number of regex retries in iteration 607: 0 [2025-11-13 12:01:56,467][__main__][INFO] - agents played in iteration 607 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:01:56,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:56,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:56,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:57,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:57,024][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:57,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:57,747][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:58,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:59,029][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:59,687][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:00,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:00,674][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:01,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:01,668][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:01,993][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:02,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:02,988][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:03,318][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:03,645][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:03,977][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:04,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:04,635][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:05,295][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:05,630][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:05,957][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:06,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:07,265][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:08,247][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:08,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:09,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:09,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:09,672][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:10,564][__main__][INFO] - Iteration 608 took 22s (36.20% Gen, 59.75% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 29m 38s. Estimated total time: 18h 24m 59s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 49s, 500 more iterations: 3h 4m 9s. [2025-11-13 12:02:10,567][__main__][INFO] - Starting iteration 608. [2025-11-13 12:02:10,569][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:02:10,570][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:19,278][__main__][INFO] - Number of regex retries in iteration 608: 0 [2025-11-13 12:02:19,278][__main__][INFO] - agents played in iteration 608 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:02:19,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:19,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:19,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:19,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:19,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:19,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:20,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:20,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:21,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:21,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:21,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:22,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:23,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:23,502][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:23,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:24,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:24,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:25,139][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:25,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:25,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:26,784][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:27,111][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:27,440][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:27,767][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:28,095][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:28,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:28,748][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:29,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:29,403][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:29,731][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:30,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:30,713][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:31,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:31,756][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:32,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:32,467][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:32,469][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:33,550][__main__][INFO] - Iteration 609 took 22s (37.89% Gen, 57.40% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 13m 20s. Estimated total time: 19h 9m 5s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 30s. [2025-11-13 12:02:33,553][__main__][INFO] - Starting iteration 609. [2025-11-13 12:02:33,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:02:33,556][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:42,763][__main__][INFO] - Number of regex retries in iteration 609: 0 [2025-11-13 12:02:42,764][__main__][INFO] - agents played in iteration 609 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:02:43,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:43,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:43,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:43,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:43,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:43,327][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:44,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:44,364][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:45,022][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:45,347][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:45,673][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:46,005][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:46,333][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:46,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:47,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:48,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:48,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:49,297][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:49,625][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:49,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:50,279][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:50,606][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:50,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:51,263][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:52,253][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:53,245][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:53,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:53,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:54,224][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:54,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:55,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:55,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:55,972][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:55,974][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:56,835][__main__][INFO] - Iteration 610 took 23s (39.55% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 53s. Estimated total time: 19h 24m 1s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 0s. [2025-11-13 12:02:56,837][__main__][INFO] - Starting iteration 610. [2025-11-13 12:02:56,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:02:56,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:06,112][__main__][INFO] - Number of regex retries in iteration 610: 0 [2025-11-13 12:03:06,113][__main__][INFO] - agents played in iteration 610 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:03:06,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:06,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:06,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:06,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:06,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:06,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:08,033][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:08,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:08,687][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:09,013][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:09,668][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:09,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:10,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:10,648][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:11,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:11,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:11,959][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:12,614][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:12,940][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:13,267][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:13,594][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:14,247][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:15,230][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:15,566][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:16,552][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:16,879][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:17,537][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:17,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:18,570][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:19,275][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:19,277][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:19,278][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:22,723][__main__][INFO] - Iteration 611 took 25s (35.82% Gen, 50.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 37m 38s. Estimated total time: 21h 34m 12s. Time estimates for 10 more iterations: 4m 18s, 100 more iterations: 43m 8s, 500 more iterations: 3h 35m 42s. [2025-11-13 12:03:22,725][__main__][INFO] - Starting iteration 611. [2025-11-13 12:03:22,729][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:22,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:32,106][__main__][INFO] - Number of regex retries in iteration 611: 0 [2025-11-13 12:03:32,106][__main__][INFO] - agents played in iteration 611 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:03:32,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:32,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:32,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:32,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:32,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:32,661][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:33,363][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:33,661][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:33,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:34,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:34,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:35,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:35,630][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:35,960][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:36,293][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:36,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:37,277][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:37,607][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:37,937][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:38,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:38,597][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:38,924][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:39,587][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:40,242][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:40,569][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:40,897][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:41,224][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:41,552][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:41,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:42,534][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:42,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:43,187][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:43,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:43,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:44,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:45,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:45,273][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:45,275][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:47,874][__main__][INFO] - Iteration 612 took 25s (37.29% Gen, 52.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 0m 18s. Estimated total time: 20h 57m 17s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 54s, 500 more iterations: 3h 29m 32s. [2025-11-13 12:03:47,876][__main__][INFO] - Starting iteration 612. [2025-11-13 12:03:47,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:47,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:56,725][__main__][INFO] - Number of regex retries in iteration 612: 0 [2025-11-13 12:03:56,725][__main__][INFO] - agents played in iteration 612 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:03:57,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:57,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:57,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:57,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:57,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:57,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:57,993][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:58,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:58,621][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:58,955][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:59,290][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:59,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:59,946][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:00,276][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:00,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:01,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:01,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:02,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:02,579][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:02,908][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:03,235][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:03,891][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:04,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:04,547][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:04,875][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:05,530][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:05,858][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:06,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:06,513][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:06,841][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:07,495][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:07,823][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:08,156][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:08,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:09,186][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:09,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:09,927][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:09,929][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:11,603][__main__][INFO] - Iteration 613 took 23s (37.28% Gen, 55.65% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 48m 52s. Estimated total time: 19h 46m 15s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 42s. [2025-11-13 12:04:11,605][__main__][INFO] - Starting iteration 613. [2025-11-13 12:04:11,609][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:11,609][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:20,747][__main__][INFO] - Number of regex retries in iteration 613: 0 [2025-11-13 12:04:20,748][__main__][INFO] - agents played in iteration 613 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:04:21,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:21,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:21,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:21,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:21,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:21,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:22,308][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:22,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:23,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:23,954][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:24,284][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:24,619][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:24,948][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:25,279][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:25,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:25,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:26,268][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:26,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:27,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:28,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:28,563][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:28,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:29,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:30,201][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:30,856][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:31,183][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:32,164][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:32,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:33,213][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:33,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:33,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:33,964][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:35,129][__main__][INFO] - Iteration 614 took 23s (38.85% Gen, 56.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 18s. Estimated total time: 19h 36m 4s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 0s. [2025-11-13 12:04:35,131][__main__][INFO] - Starting iteration 614. [2025-11-13 12:04:35,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:35,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:44,220][__main__][INFO] - Number of regex retries in iteration 614: 0 [2025-11-13 12:04:44,221][__main__][INFO] - agents played in iteration 614 are Bob_buffer, Alice, Bob, Alice_buffer [2025-11-13 12:04:44,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:44,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:44,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:44,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:44,770][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:44,770][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:45,493][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:45,793][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:46,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:46,785][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:47,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:47,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:48,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:48,433][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:48,763][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:49,093][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:49,756][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:50,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:50,743][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:51,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:51,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:51,730][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:52,058][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:52,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:53,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:53,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:54,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:55,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:55,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:56,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:57,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:57,449][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:57,450][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_bs128/seed_0/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:58,659][__main__][INFO] - Iteration 615 took 23s (38.62% Gen, 56.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 6s. Estimated total time: 19h 36m 15s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 2s. [2025-11-13 12:04:58,661][__main__][INFO] - Starting iteration 615. [2025-11-13 12:04:58,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:58,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:05:10,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:10,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,019][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,020][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,021][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,037][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,039][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,085][asyncio][WARNING] - socket.send() raised exception.