[2025-11-13 08:04:09,155][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:10,186][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:04:10,193][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:11,016][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:06:20,148][__main__][INFO] - Starting iteration 0. [2025-11-13 08:06:20,160][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:20,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:25,083][__main__][INFO] - Number of regex retries in iteration 0: 0 [2025-11-13 08:06:25,083][__main__][INFO] - agents played in iteration 0 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:06:25,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:25,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:26,249][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:26,880][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:27,207][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:27,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:28,188][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:28,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:29,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:29,499][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:30,146][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:30,794][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:32,413][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:33,062][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:33,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:33,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:34,692][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:35,015][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:35,665][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:35,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:36,638][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:36,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:37,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.58%, Current % of VRAM taken: 42.03%, Block Peak % of device VRAM: 25.21%, ΔTime: 00:00:11 [2025-11-13 08:06:38,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:38,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:38,358][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:39,496][__main__][INFO] - Iteration 1 took 19s (25.46% Gen, 68.65% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 4m 1s. Estimated total time: 16h 6m 51s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 13s, 500 more iterations: 2h 41m 8s. [2025-11-13 08:06:39,499][__main__][INFO] - Starting iteration 1. [2025-11-13 08:06:39,503][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:39,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:43,354][__main__][INFO] - Number of regex retries in iteration 1: 0 [2025-11-13 08:06:43,355][__main__][INFO] - agents played in iteration 1 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:06:43,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:43,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:44,591][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:44,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:45,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:45,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:45,870][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:46,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:46,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:47,816][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:48,140][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:48,464][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:48,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:49,438][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:49,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:50,090][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:50,414][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:50,739][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:51,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:51,387][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:52,050][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:52,702][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:53,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:53,683][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:54,011][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:54,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:54,662][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:54,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:55,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:06:56,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:56,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:56,423][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:57,525][__main__][INFO] - Iteration 2 took 18s (21.37% Gen, 72.51% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 1s. Estimated total time: 15h 1m 9s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 2s, 500 more iterations: 2h 30m 11s. [2025-11-13 08:06:57,527][__main__][INFO] - Starting iteration 2. [2025-11-13 08:06:57,530][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:57,531][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:01,181][__main__][INFO] - Number of regex retries in iteration 2: 0 [2025-11-13 08:07:01,182][__main__][INFO] - agents played in iteration 2 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:07:01,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,745][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:01,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:02,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:03,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:03,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:03,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:04,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:04,693][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:05,021][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:05,348][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:05,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:06,326][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:07,308][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:07,632][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:08,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:08,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:09,272][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:09,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:10,248][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:11,228][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:11,880][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:12,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:12,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:13,531][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:14,284][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:14,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:14,288][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:15,266][__main__][INFO] - Iteration 3 took 17s (20.59% Gen, 73.89% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 43m 24s. Estimated total time: 14h 46m 50s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 33s, 500 more iterations: 2h 27m 48s. [2025-11-13 08:07:15,268][__main__][INFO] - Starting iteration 3. [2025-11-13 08:07:15,272][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:15,273][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:18,922][__main__][INFO] - Number of regex retries in iteration 3: 0 [2025-11-13 08:07:18,923][__main__][INFO] - agents played in iteration 3 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:07:19,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:19,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:20,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:20,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:21,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:21,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:21,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:22,107][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:22,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:23,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:24,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:24,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:24,723][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:25,372][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:25,699][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:26,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:26,673][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:26,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:27,322][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:27,646][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:29,596][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:29,927][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:30,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:30,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:31,287][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:32,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:32,041][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:32,043][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:33,054][__main__][INFO] - Iteration 4 took 17s (20.52% Gen, 73.78% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 24s. Estimated total time: 14h 49m 8s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 38s, 500 more iterations: 2h 28m 11s. [2025-11-13 08:07:33,056][__main__][INFO] - Starting iteration 4. [2025-11-13 08:07:33,059][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:33,059][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:36,721][__main__][INFO] - Number of regex retries in iteration 4: 0 [2025-11-13 08:07:36,721][__main__][INFO] - agents played in iteration 4 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:07:37,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,263][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:37,263][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:38,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:38,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:39,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:40,538][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:41,194][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:41,519][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:42,169][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:42,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:44,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:44,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:44,767][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:45,092][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:45,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:45,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:46,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:46,394][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:46,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:47,045][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:47,370][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:47,694][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:48,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:49,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:49,752][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:49,754][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:49,755][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:50,741][__main__][INFO] - Iteration 5 took 17s (20.71% Gen, 73.71% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 40m 7s. Estimated total time: 14h 44m 9s. Time estimates for 10 more iterations: 2m 56s, 100 more iterations: 29m 28s, 500 more iterations: 2h 27m 21s. [2025-11-13 08:07:50,743][__main__][INFO] - Starting iteration 5. [2025-11-13 08:07:50,747][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:50,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:54,473][__main__][INFO] - Number of regex retries in iteration 5: 0 [2025-11-13 08:07:54,474][__main__][INFO] - agents played in iteration 5 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:07:54,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:55,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:55,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:56,999][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:57,325][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:58,629][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:58,955][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:59,280][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:59,930][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:00,255][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:00,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:00,905][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:01,230][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:01,555][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:01,879][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:03,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:04,155][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:04,806][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:05,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:05,457][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:06,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:06,790][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:07,523][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:07,524][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:07,526][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:08,515][__main__][INFO] - Iteration 6 took 17s (20.97% Gen, 73.46% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 44m 8s. Estimated total time: 14h 48m 28s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 36s, 500 more iterations: 2h 28m 4s. [2025-11-13 08:08:08,518][__main__][INFO] - Starting iteration 6. [2025-11-13 08:08:08,521][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:08,521][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:12,276][__main__][INFO] - Number of regex retries in iteration 6: 0 [2025-11-13 08:08:12,277][__main__][INFO] - agents played in iteration 6 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:08:12,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:12,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:13,830][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:14,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:14,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:15,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:15,461][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:15,788][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:16,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:16,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:16,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:17,093][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:17,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:17,746][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:18,071][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:18,397][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:18,722][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:19,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:19,372][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:19,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:20,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:20,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:21,004][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:21,329][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:21,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:21,981][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:22,307][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:22,958][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:23,284][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:23,610][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:23,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:24,645][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:25,405][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:25,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:25,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:26,418][__main__][INFO] - Iteration 7 took 17s (20.98% Gen, 73.37% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 50m 17s. Estimated total time: 14h 54m 54s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 49s, 500 more iterations: 2h 29m 9s. [2025-11-13 08:08:26,420][__main__][INFO] - Starting iteration 7. [2025-11-13 08:08:26,423][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:26,424][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:30,053][__main__][INFO] - Number of regex retries in iteration 7: 0 [2025-11-13 08:08:30,054][__main__][INFO] - agents played in iteration 7 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:08:30,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,612][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:30,612][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:31,315][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:31,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:32,265][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:32,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:33,573][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:34,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:34,553][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:34,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:35,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:35,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:37,500][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:38,158][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:39,138][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:39,465][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:40,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:40,453][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:40,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:41,117][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:41,447][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:41,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:42,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:43,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:43,195][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:43,196][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:44,238][__main__][INFO] - Iteration 8 took 17s (20.38% Gen, 73.77% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 51s. Estimated total time: 14h 50m 46s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 41s, 500 more iterations: 2h 28m 27s. [2025-11-13 08:08:44,240][__main__][INFO] - Starting iteration 8. [2025-11-13 08:08:44,244][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:44,244][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:47,898][__main__][INFO] - Number of regex retries in iteration 8: 0 [2025-11-13 08:08:47,899][__main__][INFO] - agents played in iteration 8 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:08:48,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:48,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:49,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:49,459][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:50,110][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:50,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:51,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:51,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:52,396][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:53,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:53,374][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:54,354][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:54,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:55,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:55,658][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:55,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:56,635][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:57,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:57,939][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:58,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:58,590][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:58,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:59,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:59,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:00,273][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:01,026][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:01,027][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:01,029][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:02,065][__main__][INFO] - Iteration 9 took 17s (20.51% Gen, 73.68% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 52s. Estimated total time: 14h 51m 5s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 42s, 500 more iterations: 2h 28m 30s. [2025-11-13 08:09:02,067][__main__][INFO] - Starting iteration 9. [2025-11-13 08:09:02,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:02,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:05,772][__main__][INFO] - Number of regex retries in iteration 9: 0 [2025-11-13 08:09:05,772][__main__][INFO] - agents played in iteration 9 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:09:06,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,322][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:06,323][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:07,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:08,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:09,639][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:09,965][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:10,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:10,943][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:11,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:12,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:12,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:13,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:13,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:14,540][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:14,865][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:15,852][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:16,828][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:17,157][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:17,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:18,181][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:18,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:18,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:18,923][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:19,937][__main__][INFO] - Iteration 10 took 17s (20.71% Gen, 73.60% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 47m 51s. Estimated total time: 14h 53m 22s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 46s, 500 more iterations: 2h 28m 53s. [2025-11-13 08:09:19,940][__main__][INFO] - Starting iteration 10. [2025-11-13 08:09:19,943][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:19,943][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:23,604][__main__][INFO] - Number of regex retries in iteration 10: 0 [2025-11-13 08:09:23,605][__main__][INFO] - agents played in iteration 10 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:09:24,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,157][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:24,158][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:25,181][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:25,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:25,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:26,163][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:26,827][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:27,156][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:27,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:28,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:28,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:28,808][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:29,135][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:29,461][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:31,105][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:31,435][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:31,764][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:32,090][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:32,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:33,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:33,726][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:34,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:34,389][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:34,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:35,043][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:35,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:36,234][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:37,005][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:37,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:37,008][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:39,022][__main__][INFO] - Iteration 11 took 19s (19.19% Gen, 70.24% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 48m 9s. Estimated total time: 15h 53m 58s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 47s, 500 more iterations: 2h 38m 59s. [2025-11-13 08:09:39,024][__main__][INFO] - Starting iteration 11. [2025-11-13 08:09:39,027][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:39,028][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:43,211][__main__][INFO] - Number of regex retries in iteration 11: 0 [2025-11-13 08:09:43,212][__main__][INFO] - agents played in iteration 11 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:09:43,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,764][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:43,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:44,469][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:45,092][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:45,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:45,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:46,069][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:46,395][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:46,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:47,050][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:47,705][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:48,368][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:48,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:49,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:49,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:49,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:50,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:50,326][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:50,652][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:50,978][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:51,306][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:51,638][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:51,965][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:52,617][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:53,268][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:53,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:54,247][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:54,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:55,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:56,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:56,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:56,332][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:57,410][__main__][INFO] - Iteration 12 took 18s (22.76% Gen, 71.37% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 13m 3s. Estimated total time: 15h 19m 11s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 38s, 500 more iterations: 2h 33m 11s. [2025-11-13 08:09:57,412][__main__][INFO] - Starting iteration 12. [2025-11-13 08:09:57,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:57,416][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:01,413][__main__][INFO] - Number of regex retries in iteration 12: 0 [2025-11-13 08:10:01,413][__main__][INFO] - agents played in iteration 12 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:10:01,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,969][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:01,970][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:02,676][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:02,973][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:03,299][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:03,954][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:04,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:04,611][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:04,945][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:05,272][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:05,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:05,925][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:06,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:06,580][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:07,234][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:07,567][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:07,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:08,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:08,873][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:09,199][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:09,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:10,505][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:10,833][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:11,160][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:11,812][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:12,792][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:13,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:13,811][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:14,538][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:14,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:14,541][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:15,547][__main__][INFO] - Iteration 13 took 18s (22.05% Gen, 72.40% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 0m 11s. Estimated total time: 15h 6m 37s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 13s, 500 more iterations: 2h 31m 6s. [2025-11-13 08:10:15,549][__main__][INFO] - Starting iteration 13. [2025-11-13 08:10:15,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:15,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:19,630][__main__][INFO] - Number of regex retries in iteration 13: 0 [2025-11-13 08:10:19,631][__main__][INFO] - agents played in iteration 13 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:10:20,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:20,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:20,908][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:21,207][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:21,535][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:21,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:22,193][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:22,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:23,186][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:23,843][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:24,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:24,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:25,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:25,504][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:25,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:26,162][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:26,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:27,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:27,816][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:28,151][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:29,145][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:29,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:30,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:30,466][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:31,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:32,175][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:32,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:32,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:32,915][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:34,059][__main__][INFO] - Iteration 14 took 18s (22.03% Gen, 71.77% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 18m 36s. Estimated total time: 15h 25m 21s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 50s, 500 more iterations: 2h 34m 13s. [2025-11-13 08:10:34,061][__main__][INFO] - Starting iteration 14. [2025-11-13 08:10:34,064][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:34,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:37,948][__main__][INFO] - Number of regex retries in iteration 14: 0 [2025-11-13 08:10:37,949][__main__][INFO] - agents played in iteration 14 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:10:38,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,508][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:38,509][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:39,528][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:40,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:40,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:40,839][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:41,165][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:41,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:41,821][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:42,146][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:42,802][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:43,131][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:43,786][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:44,113][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:44,768][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:45,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:45,422][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:45,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:46,075][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:46,401][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:46,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:47,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:47,710][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:48,036][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:48,362][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:48,689][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:49,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:50,377][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:51,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:51,124][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:51,126][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:52,166][__main__][INFO] - Iteration 15 took 18s (21.46% Gen, 72.79% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 7s. Estimated total time: 15h 5m 10s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 10s, 500 more iterations: 2h 30m 51s. [2025-11-13 08:10:52,168][__main__][INFO] - Starting iteration 15. [2025-11-13 08:10:52,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:52,172][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:56,127][__main__][INFO] - Number of regex retries in iteration 15: 0 [2025-11-13 08:10:56,128][__main__][INFO] - agents played in iteration 15 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:10:56,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,684][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:56,684][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:57,700][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:58,353][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:58,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:59,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:59,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:00,307][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:00,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:01,284][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:01,610][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:01,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:02,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:02,912][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:03,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:03,567][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:04,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:04,545][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:04,871][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:05,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:05,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:06,506][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:07,160][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:07,486][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:07,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:08,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:09,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:09,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:09,225][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:10,228][__main__][INFO] - Iteration 16 took 18s (21.91% Gen, 72.53% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 55m 31s. Estimated total time: 15h 2m 52s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 5s, 500 more iterations: 2h 30m 28s. [2025-11-13 08:11:10,230][__main__][INFO] - Starting iteration 16. [2025-11-13 08:11:10,233][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:10,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:14,195][__main__][INFO] - Number of regex retries in iteration 16: 0 [2025-11-13 08:11:14,196][__main__][INFO] - agents played in iteration 16 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:11:14,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:14,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:15,763][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:16,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:16,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:17,073][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:17,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:17,725][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:18,703][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:19,031][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:19,358][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:20,009][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:20,668][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:20,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:21,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:22,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:22,955][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:23,607][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:23,932][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:24,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:24,583][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:24,908][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:25,234][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:25,560][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:25,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:26,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:27,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:27,306][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:27,307][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:28,379][__main__][INFO] - Iteration 17 took 18s (21.83% Gen, 72.25% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 59m 40s. Estimated total time: 15h 7m 19s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 14s, 500 more iterations: 2h 31m 13s. [2025-11-13 08:11:28,381][__main__][INFO] - Starting iteration 17. [2025-11-13 08:11:28,384][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:28,385][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:32,557][__main__][INFO] - Number of regex retries in iteration 17: 0 [2025-11-13 08:11:32,557][__main__][INFO] - agents played in iteration 17 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:11:32,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:33,109][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:33,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:34,452][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:34,785][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:35,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:36,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:37,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:37,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:37,753][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:38,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:38,729][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:39,055][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:39,381][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:39,708][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:40,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:40,359][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:40,688][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:41,015][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:41,344][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:41,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:41,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:42,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:42,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:42,977][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:43,629][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:44,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:44,971][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:45,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:45,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:45,705][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:46,670][__main__][INFO] - Iteration 18 took 18s (22.82% Gen, 71.90% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 6m 23s. Estimated total time: 15h 14m 21s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 28s, 500 more iterations: 2h 32m 23s. [2025-11-13 08:11:46,672][__main__][INFO] - Starting iteration 18. [2025-11-13 08:11:46,675][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:46,676][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:50,677][__main__][INFO] - Number of regex retries in iteration 18: 0 [2025-11-13 08:11:50,678][__main__][INFO] - agents played in iteration 18 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:11:51,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,236][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:51,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:51,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:52,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:52,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:52,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:53,244][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:53,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:54,224][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:54,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:54,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:55,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:55,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:56,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:56,498][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:56,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:57,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:57,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:58,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:58,451][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:58,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:59,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:59,428][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:00,406][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:00,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:01,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:01,711][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:02,037][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:02,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:03,051][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:03,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:03,787][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:03,788][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:04,781][__main__][INFO] - Iteration 19 took 18s (22.10% Gen, 72.41% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 57m 3s. Estimated total time: 15h 5m 19s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 10s, 500 more iterations: 2h 30m 53s. [2025-11-13 08:12:04,783][__main__][INFO] - Starting iteration 19. [2025-11-13 08:12:04,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:04,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:08,855][__main__][INFO] - Number of regex retries in iteration 19: 0 [2025-11-13 08:12:08,856][__main__][INFO] - agents played in iteration 19 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:12:09,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:09,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:10,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:10,428][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:10,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:11,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:11,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:11,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:12,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:12,395][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:12,727][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:13,052][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:13,380][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:13,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:14,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:14,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:15,010][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:15,663][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:16,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:16,649][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:17,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:18,277][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:18,602][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:19,259][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:19,587][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:19,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:20,241][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:20,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:21,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:21,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:21,993][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:21,994][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:22,999][__main__][INFO] - Iteration 20 took 18s (22.34% Gen, 72.14% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 2m 8s. Estimated total time: 15h 10m 42s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 21s, 500 more iterations: 2h 31m 47s. [2025-11-13 08:12:23,001][__main__][INFO] - Starting iteration 20. [2025-11-13 08:12:23,005][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:23,005][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:27,028][__main__][INFO] - Number of regex retries in iteration 20: 0 [2025-11-13 08:12:27,029][__main__][INFO] - agents played in iteration 20 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:12:27,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:27,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:28,322][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:28,619][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:28,947][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:29,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:29,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:29,932][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:30,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:30,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:30,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:31,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:31,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:32,229][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:32,554][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:33,535][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:33,861][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:34,190][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:34,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:34,849][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:35,500][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:36,151][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:36,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:36,804][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:37,460][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:37,792][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:38,120][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:38,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:39,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:40,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:40,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:40,226][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:42,211][__main__][INFO] - Iteration 21 took 19s (20.95% Gen, 68.71% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 51m 27s. Estimated total time: 16h 0m 20s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 0s, 500 more iterations: 2h 40m 3s. [2025-11-13 08:12:42,213][__main__][INFO] - Starting iteration 21. [2025-11-13 08:12:42,216][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:12:42,217][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:46,373][__main__][INFO] - Number of regex retries in iteration 21: 0 [2025-11-13 08:12:46,374][__main__][INFO] - agents played in iteration 21 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:12:46,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,930][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:46,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:47,652][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:47,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:48,277][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:48,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:48,930][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:49,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:49,912][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:51,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:51,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:51,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:52,529][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:52,855][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:53,508][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:53,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:54,158][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:54,483][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:54,808][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:55,134][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:55,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:56,109][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:58,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:58,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:59,499][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:59,500][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:59,502][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:00,498][__main__][INFO] - Iteration 22 took 18s (22.74% Gen, 71.81% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 4m 56s. Estimated total time: 15h 14m 8s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 28s, 500 more iterations: 2h 32m 21s. [2025-11-13 08:13:00,500][__main__][INFO] - Starting iteration 22. [2025-11-13 08:13:00,503][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:00,504][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:04,529][__main__][INFO] - Number of regex retries in iteration 22: 0 [2025-11-13 08:13:04,530][__main__][INFO] - agents played in iteration 22 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:13:04,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,088][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:05,089][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:05,810][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:06,109][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:06,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:06,768][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:07,093][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:07,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:07,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:08,070][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:08,720][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:09,045][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:09,696][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:10,021][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:10,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:10,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:11,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:11,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:11,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:11,978][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:12,302][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:13,278][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:13,604][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:13,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:14,581][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:14,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:15,239][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:16,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:16,929][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:17,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:17,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:17,666][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:18,666][__main__][INFO] - Iteration 23 took 18s (22.17% Gen, 72.32% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 58m 43s. Estimated total time: 15h 8m 12s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 16s, 500 more iterations: 2h 31m 22s. [2025-11-13 08:13:18,668][__main__][INFO] - Starting iteration 23. [2025-11-13 08:13:18,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:18,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:22,715][__main__][INFO] - Number of regex retries in iteration 23: 0 [2025-11-13 08:13:22,715][__main__][INFO] - agents played in iteration 23 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:13:23,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,272][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:23,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:23,989][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:24,287][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:24,614][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:24,940][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:25,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:26,248][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:26,573][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:26,898][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:27,225][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:27,551][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:27,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:28,208][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:28,533][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:30,819][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:31,147][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:31,475][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:32,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:32,790][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:33,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:33,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:34,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:34,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:35,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:35,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:35,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:35,861][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:36,901][__main__][INFO] - Iteration 24 took 18s (22.17% Gen, 72.11% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 1m 42s. Estimated total time: 15h 11m 30s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 23s, 500 more iterations: 2h 31m 55s. [2025-11-13 08:13:36,903][__main__][INFO] - Starting iteration 24. [2025-11-13 08:13:36,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:36,907][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:40,928][__main__][INFO] - Number of regex retries in iteration 24: 0 [2025-11-13 08:13:40,928][__main__][INFO] - agents played in iteration 24 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:13:41,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,485][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:41,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:42,227][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:42,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:42,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:43,181][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:43,506][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:44,484][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:44,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:45,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:45,465][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:45,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:46,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:46,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:47,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:48,076][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:48,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:48,728][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:49,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:49,380][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:49,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:50,030][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:51,344][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:51,999][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:52,324][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:52,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:53,383][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:54,135][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:54,137][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:54,138][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:55,167][__main__][INFO] - Iteration 25 took 18s (22.02% Gen, 72.34% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 2m 58s. Estimated total time: 15h 13m 4s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 26s, 500 more iterations: 2h 32m 10s. [2025-11-13 08:13:55,169][__main__][INFO] - Starting iteration 25. [2025-11-13 08:13:55,172][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:55,173][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:59,315][__main__][INFO] - Number of regex retries in iteration 25: 0 [2025-11-13 08:13:59,316][__main__][INFO] - agents played in iteration 25 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:13:59,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,880][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:59,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:00,597][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:01,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:01,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:01,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:03,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:04,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:04,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:04,809][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:05,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:06,110][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:07,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:07,415][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:08,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:08,392][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:08,717][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:09,042][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:09,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:09,694][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:10,997][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:11,698][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:12,428][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:12,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:12,431][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:13,401][__main__][INFO] - Iteration 26 took 18s (22.73% Gen, 71.95% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 1m 4s. Estimated total time: 15h 11m 28s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 22s, 500 more iterations: 2h 31m 54s. [2025-11-13 08:14:13,403][__main__][INFO] - Starting iteration 26. [2025-11-13 08:14:13,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:13,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:17,515][__main__][INFO] - Number of regex retries in iteration 26: 0 [2025-11-13 08:14:17,515][__main__][INFO] - agents played in iteration 26 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:14:17,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:17,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:18,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:19,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:19,430][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:19,755][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:20,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:21,059][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:21,385][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:21,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:22,368][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:22,693][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:23,019][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:23,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:23,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:24,326][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:24,651][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:25,308][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:25,636][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:26,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:26,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:27,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:27,589][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:27,914][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:28,242][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:28,894][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:29,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:29,936][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:30,668][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:30,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:30,671][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:31,675][__main__][INFO] - Iteration 27 took 18s (22.49% Gen, 72.01% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 2m 45s. Estimated total time: 15h 13m 28s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 26s, 500 more iterations: 2h 32m 14s. [2025-11-13 08:14:31,677][__main__][INFO] - Starting iteration 27. [2025-11-13 08:14:31,681][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:31,681][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:35,806][__main__][INFO] - Number of regex retries in iteration 27: 0 [2025-11-13 08:14:35,806][__main__][INFO] - agents played in iteration 27 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:14:36,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,367][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:36,367][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:37,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:37,715][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:38,049][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:38,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:38,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:39,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:39,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:40,332][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:40,987][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:41,320][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:41,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:41,980][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:42,636][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:42,961][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:43,289][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:43,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:44,265][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:44,589][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:44,915][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:45,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:45,565][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:45,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:46,215][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:46,866][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:47,191][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:47,517][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:48,218][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:48,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:48,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:48,958][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:49,936][__main__][INFO] - Iteration 28 took 18s (22.60% Gen, 72.04% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 1m 49s. Estimated total time: 15h 12m 50s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 25s, 500 more iterations: 2h 32m 8s. [2025-11-13 08:14:49,938][__main__][INFO] - Starting iteration 28. [2025-11-13 08:14:49,942][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:49,943][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:54,143][__main__][INFO] - Number of regex retries in iteration 28: 0 [2025-11-13 08:14:54,144][__main__][INFO] - agents played in iteration 28 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:14:54,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:54,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:55,723][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:56,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:56,374][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:57,028][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:57,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:57,681][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:58,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:58,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:58,988][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:59,316][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:59,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:59,978][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:00,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:01,284][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:01,611][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:02,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:02,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:03,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:03,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:04,554][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:04,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:05,212][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:05,540][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:05,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:06,570][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:07,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:07,300][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:07,301][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:08,347][__main__][INFO] - Iteration 29 took 18s (22.82% Gen, 71.49% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 8m 58s. Estimated total time: 15h 20m 17s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 40s, 500 more iterations: 2h 33m 22s. [2025-11-13 08:15:08,350][__main__][INFO] - Starting iteration 29. [2025-11-13 08:15:08,354][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:08,354][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:12,445][__main__][INFO] - Number of regex retries in iteration 29: 0 [2025-11-13 08:15:12,446][__main__][INFO] - agents played in iteration 29 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:15:12,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:12,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:12,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,013][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:13,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:13,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:14,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:14,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:14,682][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:15,011][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:15,336][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:15,662][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:15,988][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:16,313][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:17,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:17,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:18,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:19,244][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:19,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:20,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:20,549][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:20,874][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:21,201][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:21,530][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:21,856][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:22,508][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:22,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:23,159][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:23,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:23,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:24,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:24,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:25,562][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:25,563][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:25,564][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:26,537][__main__][INFO] - Iteration 30 took 18s (22.50% Gen, 72.14% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 57m 37s. Estimated total time: 15h 9m 14s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 18s, 500 more iterations: 2h 31m 32s. [2025-11-13 08:15:26,540][__main__][INFO] - Starting iteration 30. [2025-11-13 08:15:26,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:26,544][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:30,701][__main__][INFO] - Number of regex retries in iteration 30: 0 [2025-11-13 08:15:30,702][__main__][INFO] - agents played in iteration 30 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:15:31,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,273][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:31,273][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:32,009][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:32,307][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:32,964][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:33,292][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:33,621][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:34,945][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:35,603][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:35,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:36,261][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:36,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:37,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:37,914][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:38,243][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:38,892][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:39,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:39,869][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:40,197][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:40,523][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:40,849][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:41,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:41,500][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:41,825][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:42,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:42,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:43,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:43,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:43,925][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:43,927][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:45,855][__main__][INFO] - Iteration 31 took 19s (21.53% Gen, 68.48% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 53m 41s. Estimated total time: 16h 5m 38s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 11s, 500 more iterations: 2h 40m 56s. [2025-11-13 08:15:45,858][__main__][INFO] - Starting iteration 31. [2025-11-13 08:15:45,861][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:15:45,862][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:50,582][__main__][INFO] - Number of regex retries in iteration 31: 0 [2025-11-13 08:15:50,583][__main__][INFO] - agents played in iteration 31 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:15:51,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,160][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:51,160][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:51,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:52,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:52,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:53,160][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:53,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:54,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:54,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:55,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:55,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:55,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:56,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:57,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:58,057][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:58,386][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:58,711][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:59,036][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:59,690][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:00,342][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:00,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:00,995][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:01,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:01,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:02,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:03,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:03,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:03,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:03,753][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:04,749][__main__][INFO] - Iteration 32 took 18s (24.99% Gen, 69.72% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 32m 13s. Estimated total time: 15h 44m 28s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 28s, 500 more iterations: 2h 37m 24s. [2025-11-13 08:16:04,751][__main__][INFO] - Starting iteration 32. [2025-11-13 08:16:04,755][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:04,755][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:09,317][__main__][INFO] - Number of regex retries in iteration 32: 0 [2025-11-13 08:16:09,317][__main__][INFO] - agents played in iteration 32 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:16:09,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,886][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:09,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:10,613][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:10,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:11,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:12,868][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:13,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:13,851][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:14,178][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:15,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:15,812][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:16,138][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:16,464][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:17,118][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:17,446][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:17,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:18,101][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:18,427][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:19,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:19,405][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:19,735][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:20,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:20,712][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:21,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:21,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:22,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:22,489][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:22,491][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:23,674][__main__][INFO] - Iteration 33 took 18s (24.11% Gen, 69.63% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 33m 25s. Estimated total time: 15h 45m 59s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 31s, 500 more iterations: 2h 37m 39s. [2025-11-13 08:16:23,676][__main__][INFO] - Starting iteration 33. [2025-11-13 08:16:23,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:23,679][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:28,165][__main__][INFO] - Number of regex retries in iteration 33: 0 [2025-11-13 08:16:28,166][__main__][INFO] - agents played in iteration 33 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:16:28,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,721][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:28,721][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:29,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:29,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:31,372][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:31,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:34,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:34,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:34,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:35,292][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:35,941][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:36,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:36,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:37,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:38,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:38,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:39,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:39,532][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:39,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:40,566][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:41,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:41,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:41,304][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:42,316][__main__][INFO] - Iteration 34 took 18s (24.07% Gen, 70.50% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 19m 0s. Estimated total time: 15h 31m 53s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 3s, 500 more iterations: 2h 35m 18s. [2025-11-13 08:16:42,318][__main__][INFO] - Starting iteration 34. [2025-11-13 08:16:42,321][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:42,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:46,871][__main__][INFO] - Number of regex retries in iteration 34: 0 [2025-11-13 08:16:46,872][__main__][INFO] - agents played in iteration 34 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:16:47,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:47,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:48,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:48,779][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:49,429][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:49,757][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:50,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:50,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:51,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:51,388][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:51,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:52,370][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:53,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:53,999][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:54,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:54,979][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:55,307][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:55,963][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:56,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:56,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:56,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:57,599][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:57,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:58,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:58,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:59,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:00,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:00,026][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:00,028][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:01,034][__main__][INFO] - Iteration 35 took 18s (24.31% Gen, 70.30% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 22m 29s. Estimated total time: 15h 35m 40s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 11s, 500 more iterations: 2h 35m 56s. [2025-11-13 08:17:01,036][__main__][INFO] - Starting iteration 35. [2025-11-13 08:17:01,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:01,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:05,557][__main__][INFO] - Number of regex retries in iteration 35: 0 [2025-11-13 08:17:05,558][__main__][INFO] - agents played in iteration 35 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:17:05,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:06,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:06,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:07,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:07,484][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:07,811][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:08,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:09,112][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:09,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:10,095][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:10,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:11,073][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:11,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:11,723][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:12,051][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:13,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:13,352][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:13,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:14,002][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:14,328][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:14,977][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:15,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:15,629][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:15,954][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:16,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:16,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:16,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:17,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:17,975][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:18,706][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:18,708][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:18,710][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:19,710][__main__][INFO] - Iteration 36 took 18s (24.20% Gen, 70.44% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 20m 3s. Estimated total time: 15h 33m 34s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 7s, 500 more iterations: 2h 35m 35s. [2025-11-13 08:17:19,712][__main__][INFO] - Starting iteration 36. [2025-11-13 08:17:19,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:19,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:24,171][__main__][INFO] - Number of regex retries in iteration 36: 0 [2025-11-13 08:17:24,172][__main__][INFO] - agents played in iteration 36 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:17:24,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,729][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:24,729][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:25,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:25,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:26,069][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:26,397][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:26,725][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:27,053][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:27,710][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:29,011][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:29,344][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:29,676][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:30,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:30,660][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:30,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:31,310][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:31,635][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:31,960][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:32,284][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:32,609][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:32,934][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:33,585][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:33,911][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:34,240][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:35,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:35,553][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:35,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:36,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:37,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:37,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:37,295][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:38,293][__main__][INFO] - Iteration 37 took 18s (23.98% Gen, 70.64% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 15m 5s. Estimated total time: 15h 28m 54s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 57s, 500 more iterations: 2h 34m 49s. [2025-11-13 08:17:38,295][__main__][INFO] - Starting iteration 37. [2025-11-13 08:17:38,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:38,299][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:40,578][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2025-11-13 08:17:43,287][__main__][INFO] - Number of regex retries in iteration 37: 1 [2025-11-13 08:17:43,287][__main__][INFO] - agents played in iteration 37 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:17:43,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:43,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:44,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:45,475][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:45,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:46,123][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:46,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:47,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:48,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:48,740][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:49,392][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:50,047][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:50,373][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:50,699][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:51,025][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:51,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:52,330][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:52,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:53,628][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:53,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:54,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:55,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:56,380][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:56,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:56,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:57,354][__main__][INFO] - Iteration 38 took 19s (26.17% Gen, 68.72% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 38m 43s. Estimated total time: 15h 52m 51s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 45s, 500 more iterations: 2h 38m 48s. [2025-11-13 08:17:57,356][__main__][INFO] - Starting iteration 38. [2025-11-13 08:17:57,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:57,360][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:01,872][__main__][INFO] - Number of regex retries in iteration 38: 0 [2025-11-13 08:18:01,873][__main__][INFO] - agents played in iteration 38 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:18:02,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,416][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:02,417][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:03,128][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:03,752][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:04,077][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:04,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:05,059][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:06,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:06,684][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:07,008][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:07,652][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:08,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:08,945][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:09,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:09,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:10,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:10,560][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:10,884][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:11,207][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:11,530][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:11,853][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:12,176][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:12,499][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:13,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:14,177][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:14,899][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:14,900][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:14,902][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:15,875][__main__][INFO] - Iteration 39 took 18s (24.37% Gen, 70.37% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 11m 22s. Estimated total time: 15h 25m 49s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 51s, 500 more iterations: 2h 34m 18s. [2025-11-13 08:18:15,877][__main__][INFO] - Starting iteration 39. [2025-11-13 08:18:15,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:15,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:20,440][__main__][INFO] - Number of regex retries in iteration 39: 0 [2025-11-13 08:18:20,441][__main__][INFO] - agents played in iteration 39 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:18:20,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,991][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:20,991][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:21,721][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:22,341][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:22,992][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:23,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:23,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:23,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:24,289][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:24,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:24,942][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:25,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:25,916][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:26,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:27,219][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:27,542][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:27,868][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:28,190][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:28,514][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:28,837][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:29,160][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:29,486][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:29,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:30,135][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:30,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:30,786][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:31,112][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:31,440][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:31,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:32,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:32,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:33,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:33,518][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:33,519][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:34,486][__main__][INFO] - Iteration 40 took 18s (24.51% Gen, 70.29% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 15m 35s. Estimated total time: 15h 30m 20s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 0s, 500 more iterations: 2h 35m 3s. [2025-11-13 08:18:34,488][__main__][INFO] - Starting iteration 40. [2025-11-13 08:18:34,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:34,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:39,078][__main__][INFO] - Number of regex retries in iteration 40: 0 [2025-11-13 08:18:39,079][__main__][INFO] - agents played in iteration 40 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:18:39,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:39,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:40,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:40,638][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:40,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:41,286][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:41,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:41,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:42,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:42,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:43,555][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:43,878][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:44,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:44,852][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:45,506][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:45,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:46,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:46,479][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:47,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:47,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:47,782][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:48,429][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:48,753][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:49,400][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:49,725][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:50,048][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:50,372][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:50,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:51,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:52,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:52,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:52,140][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:54,036][__main__][INFO] - Iteration 41 took 19s (23.47% Gen, 66.83% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 2m 8s. Estimated total time: 16h 17m 13s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 34s, 500 more iterations: 2h 42m 52s. [2025-11-13 08:18:54,038][__main__][INFO] - Starting iteration 41. [2025-11-13 08:18:54,041][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:18:54,042][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:59,247][__main__][INFO] - Number of regex retries in iteration 41: 0 [2025-11-13 08:18:59,248][__main__][INFO] - agents played in iteration 41 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:18:59,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:59,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:00,507][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:00,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:01,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:01,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:02,105][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:02,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:02,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:03,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:04,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:04,368][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:04,691][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:05,014][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:05,337][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:05,660][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:05,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:06,305][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:06,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:07,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:07,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:07,922][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:08,890][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:09,213][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:09,860][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:10,183][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:10,507][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:10,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:11,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:12,260][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:12,262][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:12,263][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:13,241][__main__][INFO] - Iteration 42 took 19s (27.11% Gen, 67.78% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 44m 40s. Estimated total time: 16h 0m 4s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 0s, 500 more iterations: 2h 40m 0s. [2025-11-13 08:19:13,243][__main__][INFO] - Starting iteration 42. [2025-11-13 08:19:13,246][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:13,247][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:18,354][__main__][INFO] - Number of regex retries in iteration 42: 0 [2025-11-13 08:19:18,355][__main__][INFO] - agents played in iteration 42 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:19:18,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,898][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:18,898][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:19,620][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:19,916][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:20,892][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:21,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:21,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:22,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:22,830][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:23,155][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:23,479][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:23,802][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:24,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:24,451][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:24,774][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:25,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:25,421][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:25,745][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:26,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:26,392][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:26,719][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:27,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:27,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:28,013][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:28,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:28,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:28,987][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:29,633][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:29,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:30,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:31,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:31,412][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:31,413][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:32,388][__main__][INFO] - Iteration 43 took 19s (26.68% Gen, 68.22% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 41m 24s. Estimated total time: 15h 57m 7s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 54s, 500 more iterations: 2h 39m 31s. [2025-11-13 08:19:32,399][__main__][INFO] - Starting iteration 43. [2025-11-13 08:19:32,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:32,403][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:37,526][__main__][INFO] - Number of regex retries in iteration 43: 0 [2025-11-13 08:19:37,526][__main__][INFO] - agents played in iteration 43 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:19:37,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,073][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:38,073][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:38,784][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:39,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:39,403][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:40,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:40,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:41,344][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:41,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:41,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:42,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:42,642][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:43,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:43,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:44,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:45,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:45,558][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:45,881][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:46,203][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:46,850][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:47,175][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:47,500][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:47,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:48,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:48,798][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:49,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:49,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:50,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:50,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:50,551][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:51,581][__main__][INFO] - Iteration 44 took 19s (26.71% Gen, 67.91% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 42m 55s. Estimated total time: 15h 58m 57s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 57s, 500 more iterations: 2h 39m 49s. [2025-11-13 08:19:51,583][__main__][INFO] - Starting iteration 44. [2025-11-13 08:19:51,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:51,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:56,700][__main__][INFO] - Number of regex retries in iteration 44: 0 [2025-11-13 08:19:56,701][__main__][INFO] - agents played in iteration 44 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:19:57,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,246][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:57,246][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:57,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:58,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:58,605][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:58,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:59,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:59,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:59,899][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:00,876][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:01,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:01,523][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:01,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:02,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:02,496][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:02,820][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:03,466][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:03,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:04,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:04,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:05,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:06,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:06,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:06,700][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:07,023][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:07,346][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:07,994][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:08,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:09,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:09,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:09,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:09,754][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:10,737][__main__][INFO] - Iteration 45 took 19s (26.70% Gen, 68.17% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 41m 14s. Estimated total time: 15h 57m 35s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 55s, 500 more iterations: 2h 39m 35s. [2025-11-13 08:20:10,739][__main__][INFO] - Starting iteration 45. [2025-11-13 08:20:10,743][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:10,744][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:15,819][__main__][INFO] - Number of regex retries in iteration 45: 0 [2025-11-13 08:20:15,820][__main__][INFO] - agents played in iteration 45 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:20:16,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:16,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:17,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:17,374][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:18,022][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:18,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:18,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:18,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:19,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:19,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:19,967][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:20,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:21,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:21,583][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:21,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:22,230][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:22,879][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:23,203][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:23,527][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:23,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:24,174][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:24,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:24,821][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:25,145][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:25,470][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:25,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:26,117][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:26,764][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:27,088][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:27,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:28,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:28,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:28,851][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:28,853][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:29,828][__main__][INFO] - Iteration 46 took 19s (26.60% Gen, 68.29% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 37m 36s. Estimated total time: 15h 54m 17s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 48s, 500 more iterations: 2h 39m 2s. [2025-11-13 08:20:29,830][__main__][INFO] - Starting iteration 46. [2025-11-13 08:20:29,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:29,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:34,883][__main__][INFO] - Number of regex retries in iteration 46: 0 [2025-11-13 08:20:34,883][__main__][INFO] - agents played in iteration 46 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:20:35,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,426][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:35,427][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:36,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:36,454][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:36,778][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:37,425][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:37,750][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:38,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:38,721][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:39,045][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:39,370][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:39,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:40,018][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:40,342][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:40,666][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:40,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:41,314][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:41,963][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:42,290][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:42,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:42,937][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:43,261][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:43,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:44,233][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:44,556][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:44,881][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:45,204][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:45,526][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:45,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:46,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:47,223][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:47,974][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:47,975][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:47,977][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:48,975][__main__][INFO] - Iteration 47 took 19s (26.38% Gen, 68.39% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 40m 10s. Estimated total time: 15h 57m 10s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 54s, 500 more iterations: 2h 39m 31s. [2025-11-13 08:20:48,978][__main__][INFO] - Starting iteration 47. [2025-11-13 08:20:48,982][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:48,982][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:54,042][__main__][INFO] - Number of regex retries in iteration 47: 0 [2025-11-13 08:20:54,043][__main__][INFO] - agents played in iteration 47 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:20:54,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:54,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:55,609][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:55,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:56,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:56,907][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:57,231][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:57,878][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:58,204][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:58,528][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:58,851][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:59,501][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:00,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:00,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:01,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:01,443][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:01,767][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:02,092][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:02,416][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:03,065][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:03,713][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:04,036][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:04,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:05,008][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:05,333][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:05,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:06,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:07,107][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:07,108][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:07,110][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:08,110][__main__][INFO] - Iteration 48 took 19s (26.45% Gen, 68.31% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 39m 10s. Estimated total time: 15h 56m 29s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 52s, 500 more iterations: 2h 39m 24s. [2025-11-13 08:21:08,113][__main__][INFO] - Starting iteration 48. [2025-11-13 08:21:08,116][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:08,117][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:13,182][__main__][INFO] - Number of regex retries in iteration 48: 0 [2025-11-13 08:21:13,183][__main__][INFO] - agents played in iteration 48 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:21:13,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,725][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:13,726][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:14,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:15,391][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:15,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:16,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:16,368][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:16,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:17,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:17,340][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:17,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:18,636][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:19,933][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:20,582][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:20,907][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:21,560][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:21,884][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:22,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:22,533][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:22,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:23,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:23,505][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:23,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:24,150][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:24,475][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:24,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:25,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:26,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:26,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:26,261][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:27,274][__main__][INFO] - Iteration 49 took 19s (26.44% Gen, 68.27% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 40m 16s. Estimated total time: 15h 57m 54s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 55s, 500 more iterations: 2h 39m 39s. [2025-11-13 08:21:27,276][__main__][INFO] - Starting iteration 49. [2025-11-13 08:21:27,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:27,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:32,471][__main__][INFO] - Number of regex retries in iteration 49: 0 [2025-11-13 08:21:32,472][__main__][INFO] - agents played in iteration 49 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:21:32,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:33,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:33,018][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:33,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:34,031][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:34,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:35,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:35,325][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:35,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:35,974][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:36,947][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:37,271][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:37,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:38,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:38,893][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:39,214][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:39,538][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:39,861][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:40,184][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:40,831][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:41,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:41,477][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:41,801][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:42,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:43,096][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:43,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:43,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:44,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:44,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:45,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:45,526][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:45,528][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:46,506][__main__][INFO] - Iteration 50 took 19s (27.00% Gen, 67.90% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 43m 25s. Estimated total time: 16h 1m 22s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 2s, 500 more iterations: 2h 40m 13s. [2025-11-13 08:21:46,508][__main__][INFO] - Starting iteration 50. [2025-11-13 08:21:46,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:46,511][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:51,555][__main__][INFO] - Number of regex retries in iteration 50: 0 [2025-11-13 08:21:51,556][__main__][INFO] - agents played in iteration 50 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:21:51,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:52,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:52,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:53,446][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:54,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:55,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:56,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:56,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:57,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:57,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:57,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:57,993][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:58,318][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:58,645][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:58,971][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:59,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:59,622][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:59,945][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:00,592][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:00,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:01,238][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:01,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:01,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:02,536][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:02,859][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:03,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:03,892][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:04,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:04,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:04,630][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:06,551][__main__][INFO] - Iteration 51 took 20s (25.17% Gen, 65.24% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 23m 47s. Estimated total time: 16h 42m 4s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 24s, 500 more iterations: 2h 47m 0s. [2025-11-13 08:22:06,581][__main__][INFO] - Starting iteration 51. [2025-11-13 08:22:06,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:06,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:12,045][__main__][INFO] - Number of regex retries in iteration 51: 0 [2025-11-13 08:22:12,045][__main__][INFO] - agents played in iteration 51 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:22:12,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:12,585][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:13,918][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:14,242][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:14,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:15,541][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:15,865][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:17,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:17,484][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:17,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:18,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:19,102][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:19,425][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:19,748][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:20,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:20,393][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:20,717][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:21,041][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:21,365][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:21,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:22,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:22,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:23,305][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:23,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:24,336][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:25,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:25,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:25,079][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:26,073][__main__][INFO] - Iteration 52 took 19s (28.02% Gen, 66.88% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 55m 51s. Estimated total time: 16h 14m 28s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 28s, 500 more iterations: 2h 42m 24s. [2025-11-13 08:22:26,075][__main__][INFO] - Starting iteration 52. [2025-11-13 08:22:26,079][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:26,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:31,422][__main__][INFO] - Number of regex retries in iteration 52: 0 [2025-11-13 08:22:31,423][__main__][INFO] - agents played in iteration 52 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:22:31,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:31,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:32,685][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:33,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:34,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:35,252][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:35,578][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:36,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:36,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:36,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:37,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:37,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:37,844][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:38,169][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:38,492][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:39,463][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:39,786][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:40,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:40,434][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:40,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:41,081][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:41,405][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:41,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:42,054][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:42,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:43,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:43,737][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:44,465][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:44,467][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:44,468][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:45,575][__main__][INFO] - Iteration 53 took 19s (27.40% Gen, 66.91% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 55m 56s. Estimated total time: 16h 14m 52s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 29s, 500 more iterations: 2h 42m 28s. [2025-11-13 08:22:45,578][__main__][INFO] - Starting iteration 53. [2025-11-13 08:22:45,581][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:45,581][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:50,973][__main__][INFO] - Number of regex retries in iteration 53: 0 [2025-11-13 08:22:50,974][__main__][INFO] - agents played in iteration 53 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:22:51,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:51,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:52,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:52,518][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:52,845][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:53,173][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:53,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:53,828][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:54,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:54,482][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:54,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:55,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:55,464][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:55,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:56,120][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:56,449][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:57,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:57,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:57,756][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:58,082][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:58,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:59,064][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:59,711][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:00,034][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:00,359][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:01,008][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:01,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:01,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:01,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:02,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:03,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:04,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:04,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:04,103][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:05,103][__main__][INFO] - Iteration 54 took 19s (27.62% Gen, 67.25% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 56m 53s. Estimated total time: 16h 16m 9s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 32s, 500 more iterations: 2h 42m 41s. [2025-11-13 08:23:05,105][__main__][INFO] - Starting iteration 54. [2025-11-13 08:23:05,109][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:05,110][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:10,420][__main__][INFO] - Number of regex retries in iteration 54: 0 [2025-11-13 08:23:10,421][__main__][INFO] - agents played in iteration 54 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:23:10,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,964][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:10,965][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:11,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:11,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:12,625][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:13,273][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:13,597][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:13,922][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:14,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:14,570][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:14,895][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:15,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:16,191][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:16,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:17,164][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:17,488][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:18,134][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:18,782][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:19,105][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:19,429][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:19,753][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:20,402][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:21,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:21,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:21,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:22,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:22,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:23,467][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:23,469][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:23,470][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:24,458][__main__][INFO] - Iteration 55 took 19s (27.44% Gen, 67.44% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 47m 55s. Estimated total time: 16h 7m 30s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 15s, 500 more iterations: 2h 41m 15s. [2025-11-13 08:23:24,460][__main__][INFO] - Starting iteration 55. [2025-11-13 08:23:24,464][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:24,464][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:29,795][__main__][INFO] - Number of regex retries in iteration 55: 0 [2025-11-13 08:23:29,796][__main__][INFO] - agents played in iteration 55 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:23:30,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,347][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:30,347][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:31,061][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:31,357][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:31,681][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:32,005][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:32,328][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:32,651][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:32,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:33,622][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:33,946][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:34,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:34,917][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:35,240][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:35,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:35,887][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:36,211][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:36,859][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:37,511][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:37,839][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:38,815][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:39,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:39,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:39,794][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:40,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:40,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:40,768][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:41,092][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:41,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:42,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:42,847][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:42,848][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:42,850][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:43,825][__main__][INFO] - Iteration 56 took 19s (27.54% Gen, 67.42% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 48m 12s. Estimated total time: 16h 8m 7s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 16s, 500 more iterations: 2h 41m 21s. [2025-11-13 08:23:43,828][__main__][INFO] - Starting iteration 56. [2025-11-13 08:23:43,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:43,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:49,198][__main__][INFO] - Number of regex retries in iteration 56: 0 [2025-11-13 08:23:49,199][__main__][INFO] - agents played in iteration 56 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:23:49,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:49,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:50,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:50,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:51,097][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:51,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:52,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:53,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:54,669][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:55,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:55,965][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:56,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:56,939][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:57,264][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:57,590][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:57,913][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:58,240][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:58,567][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:59,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:59,549][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:00,202][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:00,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:01,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:02,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:02,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:02,283][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:03,272][__main__][INFO] - Iteration 57 took 19s (27.60% Gen, 67.30% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 51m 51s. Estimated total time: 16h 12m 5s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 24s, 500 more iterations: 2h 42m 0s. [2025-11-13 08:24:03,274][__main__][INFO] - Starting iteration 57. [2025-11-13 08:24:03,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:03,278][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:08,603][__main__][INFO] - Number of regex retries in iteration 57: 0 [2025-11-13 08:24:08,603][__main__][INFO] - agents played in iteration 57 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:24:09,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,145][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:09,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:10,827][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:11,153][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:11,799][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:12,123][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:12,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:12,773][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:13,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:13,423][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:14,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:14,395][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:14,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:15,043][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:15,367][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:16,663][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:17,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:17,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:17,959][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:18,284][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:18,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:19,256][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:19,581][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:19,905][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:20,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:20,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:21,692][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:21,694][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:21,695][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:22,697][__main__][INFO] - Iteration 58 took 19s (27.42% Gen, 67.41% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 50m 29s. Estimated total time: 16h 11m 2s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 22s, 500 more iterations: 2h 41m 50s. [2025-11-13 08:24:22,700][__main__][INFO] - Starting iteration 58. [2025-11-13 08:24:22,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:22,704][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:28,066][__main__][INFO] - Number of regex retries in iteration 58: 0 [2025-11-13 08:24:28,066][__main__][INFO] - agents played in iteration 58 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:24:28,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,612][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:28,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:29,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:29,630][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:30,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:30,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:30,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:31,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:31,583][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:31,907][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:32,233][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:32,557][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:32,885][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:33,208][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:33,531][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:34,179][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:34,502][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:34,826][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:35,151][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:35,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:36,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:36,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:37,109][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:37,432][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:37,755][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:38,404][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:39,051][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:39,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:40,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:41,126][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:41,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:41,129][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:42,171][__main__][INFO] - Iteration 59 took 19s (27.55% Gen, 67.10% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 52m 34s. Estimated total time: 16h 13m 26s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 26s, 500 more iterations: 2h 42m 14s. [2025-11-13 08:24:42,174][__main__][INFO] - Starting iteration 59. [2025-11-13 08:24:42,177][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:42,177][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:47,509][__main__][INFO] - Number of regex retries in iteration 59: 0 [2025-11-13 08:24:47,509][__main__][INFO] - agents played in iteration 59 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:24:47,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:48,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:48,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:48,055][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:48,055][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:48,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:49,068][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:49,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:50,045][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:50,693][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:51,018][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:51,343][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:51,666][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:52,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:52,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:53,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:54,258][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:54,582][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:54,906][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:55,230][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:55,553][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:56,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:57,516][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:57,842][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:58,167][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:58,492][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:59,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:59,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:00,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:00,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:00,585][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:01,557][__main__][INFO] - Iteration 60 took 19s (27.51% Gen, 67.47% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 47m 50s. Estimated total time: 16h 9m 2s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 18s, 500 more iterations: 2h 41m 30s. [2025-11-13 08:25:01,559][__main__][INFO] - Starting iteration 60. [2025-11-13 08:25:01,562][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:25:01,563][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:06,824][__main__][INFO] - Number of regex retries in iteration 60: 0 [2025-11-13 08:25:06,824][__main__][INFO] - agents played in iteration 60 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:25:07,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,370][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:07,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:08,397][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:09,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:09,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:09,692][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:10,016][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:10,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:10,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:11,633][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:11,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:12,280][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:12,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:13,580][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:13,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:14,231][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:14,879][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:15,205][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:15,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:15,852][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:16,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:17,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:18,127][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:18,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:19,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:19,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:19,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:19,916][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:21,947][__main__][INFO] - Iteration 61 took 20s (25.81% Gen, 64.22% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 37m 44s. Estimated total time: 16h 59m 17s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 58s, 500 more iterations: 2h 49m 52s. [2025-11-13 08:25:21,949][__main__][INFO] - Starting iteration 61. [2025-11-13 08:25:21,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:21,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:27,728][__main__][INFO] - Number of regex retries in iteration 61: 0 [2025-11-13 08:25:27,729][__main__][INFO] - agents played in iteration 61 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:25:28,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:28,275][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:28,985][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:29,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:29,605][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:29,929][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:30,255][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:30,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:30,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:31,227][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:31,551][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:31,875][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:32,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:32,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:33,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:33,817][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:35,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:35,434][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:35,758][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:36,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:36,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:36,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:37,052][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:37,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:37,703][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:38,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:38,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:38,997][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:39,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:40,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:40,744][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:40,745][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:40,747][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:41,720][__main__][INFO] - Iteration 62 took 19s (29.21% Gen, 65.86% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 6m 31s. Estimated total time: 16h 28m 23s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 56s, 500 more iterations: 2h 44m 43s. [2025-11-13 08:25:41,722][__main__][INFO] - Starting iteration 62. [2025-11-13 08:25:41,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:41,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:47,330][__main__][INFO] - Number of regex retries in iteration 62: 0 [2025-11-13 08:25:47,330][__main__][INFO] - agents played in iteration 62 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:25:47,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:47,878][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:48,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:48,898][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:49,553][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:49,876][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:50,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:50,524][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:50,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:51,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:51,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:51,822][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:52,146][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:52,470][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:52,793][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:53,116][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:53,763][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:55,058][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:55,382][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:56,353][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:56,678][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:57,326][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:57,981][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:58,635][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:58,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:59,682][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:00,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:00,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:00,423][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:01,400][__main__][INFO] - Iteration 63 took 19s (28.49% Gen, 66.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 1m 35s. Estimated total time: 16h 23m 47s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 47s, 500 more iterations: 2h 43m 57s. [2025-11-13 08:26:01,402][__main__][INFO] - Starting iteration 63. [2025-11-13 08:26:01,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:01,406][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:07,019][__main__][INFO] - Number of regex retries in iteration 63: 0 [2025-11-13 08:26:07,019][__main__][INFO] - agents played in iteration 63 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:26:07,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:07,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:08,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:08,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:09,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:09,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:10,203][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:10,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:11,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:11,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:12,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:12,473][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:12,797][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:13,122][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:14,093][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:14,416][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:14,739][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:15,063][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:16,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:16,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:16,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:17,327][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:17,973][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:18,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:18,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:19,333][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:20,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:20,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:20,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:21,086][__main__][INFO] - Iteration 64 took 19s (28.52% Gen, 66.31% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 1m 30s. Estimated total time: 16h 24m 2s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 48s, 500 more iterations: 2h 44m 0s. [2025-11-13 08:26:21,088][__main__][INFO] - Starting iteration 64. [2025-11-13 08:26:21,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:21,092][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:26,797][__main__][INFO] - Number of regex retries in iteration 64: 0 [2025-11-13 08:26:26,798][__main__][INFO] - agents played in iteration 64 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:26:27,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:27,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:28,080][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:28,376][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:29,027][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:29,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:29,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:30,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:30,658][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:30,985][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:31,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:31,640][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:32,616][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:32,942][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:33,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:33,922][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:34,248][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:34,900][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:35,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:35,873][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:36,197][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:36,849][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:37,172][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:37,496][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:37,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:38,143][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:38,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:39,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:39,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:39,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:39,942][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:40,941][__main__][INFO] - Iteration 65 took 19s (28.74% Gen, 66.22% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 9m 40s. Estimated total time: 16h 32m 32s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 5s, 500 more iterations: 2h 45m 25s. [2025-11-13 08:26:40,949][__main__][INFO] - Starting iteration 65. [2025-11-13 08:26:40,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:40,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:46,521][__main__][INFO] - Number of regex retries in iteration 65: 0 [2025-11-13 08:26:46,521][__main__][INFO] - agents played in iteration 65 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:26:46,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:47,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:48,084][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:48,409][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:48,733][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:49,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:49,384][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:49,708][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:50,032][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:50,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:50,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:51,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:51,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:51,975][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:52,630][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:53,279][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:53,602][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:53,928][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:54,580][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:55,550][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:55,875][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:56,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:56,523][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:56,846][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:57,170][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:57,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:58,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:58,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:59,590][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:59,591][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:59,593][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:00,670][__main__][INFO] - Iteration 66 took 19s (28.24% Gen, 66.29% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 2m 45s. Estimated total time: 16h 25m 56s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 51s, 500 more iterations: 2h 44m 19s. [2025-11-13 08:27:00,672][__main__][INFO] - Starting iteration 66. [2025-11-13 08:27:00,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:00,676][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:06,260][__main__][INFO] - Number of regex retries in iteration 66: 0 [2025-11-13 08:27:06,260][__main__][INFO] - agents played in iteration 66 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:27:06,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:06,806][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:07,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:07,820][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:08,145][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:08,474][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:08,798][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:09,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:10,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:10,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:11,714][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:12,038][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:12,362][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:12,685][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:13,332][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:14,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:14,626][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:15,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:15,922][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:16,245][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:16,570][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:16,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:17,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:17,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:17,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:18,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:19,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:19,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:19,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:20,245][__main__][INFO] - Iteration 67 took 19s (28.53% Gen, 66.59% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 54m 57s. Estimated total time: 16h 18m 28s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 36s, 500 more iterations: 2h 43m 4s. [2025-11-13 08:27:20,247][__main__][INFO] - Starting iteration 67. [2025-11-13 08:27:20,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:20,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:25,939][__main__][INFO] - Number of regex retries in iteration 67: 0 [2025-11-13 08:27:25,940][__main__][INFO] - agents played in iteration 67 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:27:26,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:26,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:27,834][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:28,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:28,485][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:28,810][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:29,138][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:29,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:29,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:30,113][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:30,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:30,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:31,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:31,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:32,062][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:32,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:33,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:33,357][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:33,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:34,005][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:34,654][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:36,279][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:36,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:37,249][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:37,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:38,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:39,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:39,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:39,027][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:40,001][__main__][INFO] - Iteration 68 took 19s (28.80% Gen, 66.26% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 3m 46s. Estimated total time: 16h 27m 37s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 55s, 500 more iterations: 2h 44m 36s. [2025-11-13 08:27:40,004][__main__][INFO] - Starting iteration 68. [2025-11-13 08:27:40,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:40,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:45,588][__main__][INFO] - Number of regex retries in iteration 68: 0 [2025-11-13 08:27:45,589][__main__][INFO] - agents played in iteration 68 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:27:46,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,149][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:46,149][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:46,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:47,172][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:47,496][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:48,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:48,470][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:49,116][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:49,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:49,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:51,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:51,383][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:51,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:52,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:52,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:53,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:53,324][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:54,296][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:54,621][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:54,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:55,591][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:55,915][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:56,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:56,565][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:56,893][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:57,217][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:57,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:58,668][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:58,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:58,671][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:59,635][__main__][INFO] - Iteration 69 took 19s (28.43% Gen, 66.65% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 57m 16s. Estimated total time: 16h 21m 26s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 42s, 500 more iterations: 2h 43m 34s. [2025-11-13 08:27:59,637][__main__][INFO] - Starting iteration 69. [2025-11-13 08:27:59,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:59,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:05,259][__main__][INFO] - Number of regex retries in iteration 69: 0 [2025-11-13 08:28:05,259][__main__][INFO] - agents played in iteration 69 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:28:05,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,809][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:05,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:06,534][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:06,831][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:07,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:07,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:07,810][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:08,135][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:08,462][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:08,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:09,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:10,085][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:10,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:10,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:11,397][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:11,723][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:12,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:13,026][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:13,349][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:13,675][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:14,003][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:14,329][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:14,977][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:15,625][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:15,950][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:16,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:16,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:16,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:17,644][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:18,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:18,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:18,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:19,431][__main__][INFO] - Iteration 70 took 19s (28.39% Gen, 66.35% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 5m 3s. Estimated total time: 16h 29m 33s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 59s, 500 more iterations: 2h 44m 55s. [2025-11-13 08:28:19,433][__main__][INFO] - Starting iteration 70. [2025-11-13 08:28:19,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:28:19,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:25,035][__main__][INFO] - Number of regex retries in iteration 70: 0 [2025-11-13 08:28:25,035][__main__][INFO] - agents played in iteration 70 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:28:25,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,592][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:25,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:26,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:27,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:28,873][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:29,197][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:29,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:29,847][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:30,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:30,819][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:31,144][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:32,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:32,443][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:33,738][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:34,063][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:34,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:34,711][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:35,035][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:35,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:36,007][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:36,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:36,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:37,376][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:38,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:38,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:38,126][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:40,035][__main__][INFO] - Iteration 71 took 20s (27.18% Gen, 63.55% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 45m 10s. Estimated total time: 17h 10m 1s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 20s, 500 more iterations: 2h 51m 40s. [2025-11-13 08:28:40,037][__main__][INFO] - Starting iteration 71. [2025-11-13 08:28:40,041][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:28:40,042][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:46,133][__main__][INFO] - Number of regex retries in iteration 71: 0 [2025-11-13 08:28:46,134][__main__][INFO] - agents played in iteration 71 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:28:46,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:46,680][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:47,417][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:47,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:48,042][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:48,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:49,025][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:49,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:49,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:49,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:50,646][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:51,296][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:51,944][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:52,594][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:52,918][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:53,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:53,572][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:53,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:54,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:54,872][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:55,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:55,522][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:56,169][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:56,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:57,142][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:57,468][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:57,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:58,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:59,263][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:59,265][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:59,267][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:00,268][__main__][INFO] - Iteration 72 took 20s (30.12% Gen, 64.93% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 26m 12s. Estimated total time: 16h 51m 23s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 42s, 500 more iterations: 2h 48m 33s. [2025-11-13 08:29:00,270][__main__][INFO] - Starting iteration 72. [2025-11-13 08:29:00,273][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:00,274][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:06,190][__main__][INFO] - Number of regex retries in iteration 72: 0 [2025-11-13 08:29:06,191][__main__][INFO] - agents played in iteration 72 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:29:06,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,735][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:06,736][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:07,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:07,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:08,409][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:09,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:09,379][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:10,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:11,005][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:11,330][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:11,977][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:12,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:13,923][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:14,247][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:15,227][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:15,554][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:15,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:16,203][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:16,526][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:17,503][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:17,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:18,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:19,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:19,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:19,291][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:20,310][__main__][INFO] - Iteration 73 took 20s (29.53% Gen, 65.38% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 16m 21s. Estimated total time: 16h 41m 52s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 23s, 500 more iterations: 2h 46m 58s. [2025-11-13 08:29:20,312][__main__][INFO] - Starting iteration 73. [2025-11-13 08:29:20,315][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:20,316][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:26,294][__main__][INFO] - Number of regex retries in iteration 73: 0 [2025-11-13 08:29:26,295][__main__][INFO] - agents played in iteration 73 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:29:26,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:26,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:27,556][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:27,851][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:28,177][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:28,500][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:28,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:29,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:30,127][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:30,452][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:31,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:31,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:32,736][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:33,062][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:33,386][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:33,711][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:34,034][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:35,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:35,664][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:35,988][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:36,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:37,286][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:37,610][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:37,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:38,645][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:39,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:39,380][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:39,381][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:40,386][__main__][INFO] - Iteration 74 took 20s (29.79% Gen, 65.20% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 17m 44s. Estimated total time: 16h 43m 35s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 27s, 500 more iterations: 2h 47m 15s. [2025-11-13 08:29:40,388][__main__][INFO] - Starting iteration 74. [2025-11-13 08:29:40,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:40,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:46,231][__main__][INFO] - Number of regex retries in iteration 74: 0 [2025-11-13 08:29:46,231][__main__][INFO] - agents played in iteration 74 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:29:46,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:46,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:47,493][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:47,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:48,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:48,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:48,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:49,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:49,410][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:49,734][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:50,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:50,707][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:51,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:51,364][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:51,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:52,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:52,339][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:52,666][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:53,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:53,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:53,966][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:54,292][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:54,616][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:54,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:55,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:56,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:56,562][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:56,887][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:57,211][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:57,534][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:57,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:58,574][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:59,303][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:59,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:59,306][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:00,311][__main__][INFO] - Iteration 75 took 19s (29.31% Gen, 65.63% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 9m 51s. Estimated total time: 16h 36m 2s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 12s, 500 more iterations: 2h 46m 0s. [2025-11-13 08:30:00,313][__main__][INFO] - Starting iteration 75. [2025-11-13 08:30:00,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:00,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:06,192][__main__][INFO] - Number of regex retries in iteration 75: 0 [2025-11-13 08:30:06,193][__main__][INFO] - agents played in iteration 75 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:30:06,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,741][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:06,741][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:07,461][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:07,757][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:08,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:08,736][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:09,387][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:09,712][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:10,359][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:10,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:11,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:11,335][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:11,659][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:12,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:13,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:13,615][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:13,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:14,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:14,589][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:14,913][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:15,566][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:15,889][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:16,218][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:17,529][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:17,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:18,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:19,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:19,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:19,303][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:20,284][__main__][INFO] - Iteration 76 took 19s (29.43% Gen, 65.66% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 53s. Estimated total time: 16h 38m 24s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 16s, 500 more iterations: 2h 46m 24s. [2025-11-13 08:30:20,286][__main__][INFO] - Starting iteration 76. [2025-11-13 08:30:20,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:20,289][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:26,216][__main__][INFO] - Number of regex retries in iteration 76: 0 [2025-11-13 08:30:26,217][__main__][INFO] - agents played in iteration 76 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:30:26,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:26,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:27,490][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:27,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:28,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:28,769][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:29,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:29,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:30,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:30,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:31,050][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:31,376][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:31,702][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:32,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:32,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:33,001][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:33,648][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:34,294][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:34,617][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:36,558][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:36,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:37,206][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:37,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:38,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:39,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:39,309][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:39,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:40,373][__main__][INFO] - Iteration 77 took 20s (29.51% Gen, 65.20% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 17m 23s. Estimated total time: 16h 44m 15s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 28s, 500 more iterations: 2h 47m 22s. [2025-11-13 08:30:40,375][__main__][INFO] - Starting iteration 77. [2025-11-13 08:30:40,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:40,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:46,208][__main__][INFO] - Number of regex retries in iteration 77: 0 [2025-11-13 08:30:46,209][__main__][INFO] - agents played in iteration 77 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:30:46,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,765][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:46,765][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:47,484][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:47,779][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:48,107][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:48,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:48,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:49,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:49,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:49,730][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:50,054][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:51,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:51,692][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:52,015][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:52,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:52,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:52,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:53,317][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:54,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:54,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:54,945][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:55,268][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:55,591][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:55,916][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:56,244][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:56,890][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:57,541][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:57,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:58,583][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:59,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:59,323][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:59,324][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:00,353][__main__][INFO] - Iteration 78 took 19s (29.19% Gen, 65.65% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 36s. Estimated total time: 16h 38m 48s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 17s, 500 more iterations: 2h 46m 28s. [2025-11-13 08:31:00,356][__main__][INFO] - Starting iteration 78. [2025-11-13 08:31:00,360][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:00,360][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:06,241][__main__][INFO] - Number of regex retries in iteration 78: 0 [2025-11-13 08:31:06,242][__main__][INFO] - agents played in iteration 78 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:31:06,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,784][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:06,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:07,508][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:07,806][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:08,132][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:08,460][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:09,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:09,440][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:09,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:10,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:10,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:11,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:11,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:11,721][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:12,048][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:12,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:12,700][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:13,028][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:13,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:13,677][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:14,334][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:14,990][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:15,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:15,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:15,968][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:16,296][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:16,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:17,279][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:17,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:17,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:18,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:19,380][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:19,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:19,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:20,371][__main__][INFO] - Iteration 79 took 20s (29.39% Gen, 65.67% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 13m 4s. Estimated total time: 16h 40m 36s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 21s, 500 more iterations: 2h 46m 46s. [2025-11-13 08:31:20,373][__main__][INFO] - Starting iteration 79. [2025-11-13 08:31:20,376][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:20,377][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:26,129][__main__][INFO] - Number of regex retries in iteration 79: 0 [2025-11-13 08:31:26,130][__main__][INFO] - agents played in iteration 79 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:31:26,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:26,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:27,391][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:27,688][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:28,018][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:28,669][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:28,998][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:30,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:30,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:31,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:31,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:32,246][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:33,221][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:33,545][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:33,870][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:34,194][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:34,519][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:35,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:35,818][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:36,143][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:36,469][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:36,793][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:37,116][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:37,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:37,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:38,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:39,203][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:39,204][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:39,206][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:40,230][__main__][INFO] - Iteration 80 took 19s (28.97% Gen, 65.86% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 4m 53s. Estimated total time: 16h 32m 44s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 5s, 500 more iterations: 2h 45m 27s. [2025-11-13 08:31:40,233][__main__][INFO] - Starting iteration 80. [2025-11-13 08:31:40,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:40,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:46,085][__main__][INFO] - Number of regex retries in iteration 80: 0 [2025-11-13 08:31:46,086][__main__][INFO] - agents played in iteration 80 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:31:46,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:46,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:47,640][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:47,965][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:48,290][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:48,938][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:49,588][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:50,891][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:51,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:51,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:52,837][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:53,161][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:53,811][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:54,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:54,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:54,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:55,435][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:56,083][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:56,406][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:56,732][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:57,056][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:57,382][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:57,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:58,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:59,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:59,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:59,154][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:01,113][__main__][INFO] - Iteration 81 took 20s (28.01% Gen, 62.59% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 55m 40s. Estimated total time: 17h 23m 52s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 47s, 500 more iterations: 2h 53m 58s. [2025-11-13 08:32:01,115][__main__][INFO] - Starting iteration 81. [2025-11-13 08:32:01,119][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:01,119][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:07,302][__main__][INFO] - Number of regex retries in iteration 81: 0 [2025-11-13 08:32:07,302][__main__][INFO] - agents played in iteration 81 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:32:07,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:07,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:07,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:07,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:07,845][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:07,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:08,565][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:09,187][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:09,512][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:09,837][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:10,484][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:11,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:11,456][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:12,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:12,427][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:13,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:13,724][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:14,048][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:14,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:15,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:15,344][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:15,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:15,995][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:16,319][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:16,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:16,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:17,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:18,264][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:18,589][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:18,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:19,596][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:20,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:20,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:20,314][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:21,304][__main__][INFO] - Iteration 82 took 20s (30.63% Gen, 64.46% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 20m 45s. Estimated total time: 16h 49m 17s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 38s, 500 more iterations: 2h 48m 12s. [2025-11-13 08:32:21,306][__main__][INFO] - Starting iteration 82. [2025-11-13 08:32:21,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:21,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:27,475][__main__][INFO] - Number of regex retries in iteration 82: 0 [2025-11-13 08:32:27,476][__main__][INFO] - agents played in iteration 82 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:32:27,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:27,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:27,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:28,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:28,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:29,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:30,070][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:30,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:31,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:31,700][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:32,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:33,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:33,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:33,651][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:33,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:34,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:34,948][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:35,597][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:35,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:36,246][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:36,569][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:37,218][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:37,542][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:38,192][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:38,516][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:39,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:39,840][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:40,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:40,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:40,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:41,536][__main__][INFO] - Iteration 83 took 20s (30.48% Gen, 64.66% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 22m 32s. Estimated total time: 16h 51m 24s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 42s, 500 more iterations: 2h 48m 34s. [2025-11-13 08:32:41,538][__main__][INFO] - Starting iteration 83. [2025-11-13 08:32:41,542][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:41,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:47,741][__main__][INFO] - Number of regex retries in iteration 83: 0 [2025-11-13 08:32:47,741][__main__][INFO] - agents played in iteration 83 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:32:48,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:48,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:49,002][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:49,299][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:49,624][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:49,949][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:50,273][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:50,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:51,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:51,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:52,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:52,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:53,514][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:53,837][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:54,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:54,811][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:55,137][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:55,461][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:55,784][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:56,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:56,756][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:57,082][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:58,055][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:58,378][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:59,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:00,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:00,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:00,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:00,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:01,731][__main__][INFO] - Iteration 84 took 20s (30.70% Gen, 64.49% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 20m 16s. Estimated total time: 16h 49m 29s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 38s, 500 more iterations: 2h 48m 14s. [2025-11-13 08:33:01,733][__main__][INFO] - Starting iteration 84. [2025-11-13 08:33:01,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:01,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:07,960][__main__][INFO] - Number of regex retries in iteration 84: 0 [2025-11-13 08:33:07,961][__main__][INFO] - agents played in iteration 84 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:33:08,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:08,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:10,189][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:10,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:10,837][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:12,457][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:13,105][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:13,430][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:14,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:14,402][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:14,726][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:15,050][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:15,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:16,348][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:16,671][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:16,995][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:17,319][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:17,642][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:17,965][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:18,612][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:18,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:19,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:20,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:21,011][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:21,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:21,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:22,073][__main__][INFO] - Iteration 85 took 20s (30.60% Gen, 64.18% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 27m 21s. Estimated total time: 16h 56m 54s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 53s, 500 more iterations: 2h 49m 29s. [2025-11-13 08:33:22,075][__main__][INFO] - Starting iteration 85. [2025-11-13 08:33:22,079][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:22,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:28,298][__main__][INFO] - Number of regex retries in iteration 85: 0 [2025-11-13 08:33:28,299][__main__][INFO] - agents played in iteration 85 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:33:28,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:28,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:28,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:28,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:28,845][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:28,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:29,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:29,850][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:30,506][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:30,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:31,158][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:31,811][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:32,138][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:32,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:32,788][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:33,111][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:33,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:34,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:35,062][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:35,385][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:35,711][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:36,036][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:36,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:36,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:37,010][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:37,336][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:39,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:40,618][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:41,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:41,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:41,338][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:42,356][__main__][INFO] - Iteration 86 took 20s (30.67% Gen, 64.30% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 24m 2s. Estimated total time: 16h 53m 55s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 47s, 500 more iterations: 2h 48m 59s. [2025-11-13 08:33:42,359][__main__][INFO] - Starting iteration 86. [2025-11-13 08:33:42,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:42,362][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:48,618][__main__][INFO] - Number of regex retries in iteration 86: 0 [2025-11-13 08:33:48,619][__main__][INFO] - agents played in iteration 86 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:33:49,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:49,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:50,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:50,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:50,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:51,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:51,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:52,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:52,464][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:52,788][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:53,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:53,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:54,411][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:54,736][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:55,385][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:55,709][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:56,032][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:56,681][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:57,005][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:57,330][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:57,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:58,628][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:59,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:59,601][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:00,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:00,962][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:01,695][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:01,697][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:01,698][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:02,677][__main__][INFO] - Iteration 87 took 20s (30.79% Gen, 64.38% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 25m 35s. Estimated total time: 16h 55m 48s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 51s, 500 more iterations: 2h 49m 18s. [2025-11-13 08:34:02,679][__main__][INFO] - Starting iteration 87. [2025-11-13 08:34:02,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:02,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:08,852][__main__][INFO] - Number of regex retries in iteration 87: 0 [2025-11-13 08:34:08,853][__main__][INFO] - agents played in iteration 87 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:34:09,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,401][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:09,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:11,065][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:11,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:12,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:12,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:12,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:13,018][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:13,343][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:13,667][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:13,990][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:14,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:14,639][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:14,964][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:15,288][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:15,613][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:15,939][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:16,263][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:16,587][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:16,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:17,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:18,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:18,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:18,857][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:19,504][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:19,828][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:20,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:20,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:21,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:21,910][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:21,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:21,913][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:22,865][__main__][INFO] - Iteration 88 took 20s (30.57% Gen, 64.70% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 18m 38s. Estimated total time: 16h 49m 11s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 38s, 500 more iterations: 2h 48m 11s. [2025-11-13 08:34:22,867][__main__][INFO] - Starting iteration 88. [2025-11-13 08:34:22,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:22,871][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:29,046][__main__][INFO] - Number of regex retries in iteration 88: 0 [2025-11-13 08:34:29,046][__main__][INFO] - agents played in iteration 88 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:34:29,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:29,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:30,314][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:30,941][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:31,591][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:32,240][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:32,889][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:33,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:34,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:34,512][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:34,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:35,160][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:35,484][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:36,138][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:36,463][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:36,787][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:37,438][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:37,761][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:38,085][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:38,410][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:38,734][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:39,382][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:39,708][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:40,032][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:40,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:40,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:41,398][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:42,130][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:42,131][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:42,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:43,114][__main__][INFO] - Iteration 89 took 20s (30.50% Gen, 64.65% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 21m 20s. Estimated total time: 16h 52m 13s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 44s, 500 more iterations: 2h 48m 42s. [2025-11-13 08:34:43,116][__main__][INFO] - Starting iteration 89. [2025-11-13 08:34:43,120][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:43,120][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:49,337][__main__][INFO] - Number of regex retries in iteration 89: 0 [2025-11-13 08:34:49,337][__main__][INFO] - agents played in iteration 89 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:34:49,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,879][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:49,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:50,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:51,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:51,564][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:51,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:52,214][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:52,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:52,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:53,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:54,166][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:54,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:55,788][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:56,113][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:56,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:57,410][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:58,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:58,386][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:58,709][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:59,034][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:59,359][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:59,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:00,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:00,658][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:00,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:01,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:02,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:02,461][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:02,463][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:03,463][__main__][INFO] - Iteration 90 took 20s (30.56% Gen, 64.52% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 25m 58s. Estimated total time: 16h 57m 12s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 54s, 500 more iterations: 2h 49m 32s. [2025-11-13 08:35:03,465][__main__][INFO] - Starting iteration 90. [2025-11-13 08:35:03,468][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:35:03,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:09,677][__main__][INFO] - Number of regex retries in iteration 90: 0 [2025-11-13 08:35:09,678][__main__][INFO] - agents played in iteration 90 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:35:10,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:10,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:10,946][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:11,242][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:11,567][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:11,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:12,862][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:13,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:14,169][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:14,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:15,149][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:15,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:15,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:16,130][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:16,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:16,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:17,432][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:17,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:18,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:18,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:18,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:19,052][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:19,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:20,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:20,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:20,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:20,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:21,320][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:22,044][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:22,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:22,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:22,793][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:24,764][__main__][INFO] - Iteration 91 took 21s (29.15% Gen, 61.58% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 13m 13s. Estimated total time: 17h 44m 48s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 29s, 500 more iterations: 2h 57m 28s. [2025-11-13 08:35:24,766][__main__][INFO] - Starting iteration 91. [2025-11-13 08:35:24,769][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:24,769][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:31,427][__main__][INFO] - Number of regex retries in iteration 91: 0 [2025-11-13 08:35:31,428][__main__][INFO] - agents played in iteration 91 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:35:31,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,971][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:31,971][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:32,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:32,990][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:33,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:33,964][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:34,288][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:34,939][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:36,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:36,883][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:37,856][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:38,503][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:38,827][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:39,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:39,800][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:40,449][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:40,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:41,097][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:41,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:41,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:42,394][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:43,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:43,766][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:44,506][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:44,507][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:44,509][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:45,499][__main__][INFO] - Iteration 92 took 20s (32.12% Gen, 63.10% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 44m 35s. Estimated total time: 17h 16m 31s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 33s, 500 more iterations: 2h 52m 45s. [2025-11-13 08:35:45,501][__main__][INFO] - Starting iteration 92. [2025-11-13 08:35:45,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:45,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:51,888][__main__][INFO] - Number of regex retries in iteration 92: 0 [2025-11-13 08:35:51,889][__main__][INFO] - agents played in iteration 92 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:35:52,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:52,443][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:53,170][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:53,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:54,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:54,762][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:55,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:55,744][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:56,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:57,705][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:58,354][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:58,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:59,003][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:59,329][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:59,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:59,979][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:00,303][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:00,629][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:00,956][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:01,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:01,933][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:02,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:03,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:04,292][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:05,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:05,026][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:05,028][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:06,017][__main__][INFO] - Iteration 93 took 20s (31.12% Gen, 64.05% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 33m 22s. Estimated total time: 17h 5m 39s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 11s, 500 more iterations: 2h 50m 56s. [2025-11-13 08:36:06,019][__main__][INFO] - Starting iteration 93. [2025-11-13 08:36:06,022][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:06,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:12,423][__main__][INFO] - Number of regex retries in iteration 93: 0 [2025-11-13 08:36:12,423][__main__][INFO] - agents played in iteration 93 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:36:12,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:12,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:13,695][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:13,990][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:14,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:14,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:14,967][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:15,296][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:16,594][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:16,920][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:17,244][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:18,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:18,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:18,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:19,187][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:19,510][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:20,158][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:20,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:21,129][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:21,452][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:21,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:22,424][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:22,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:23,071][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:23,394][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:24,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:24,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:25,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:25,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:25,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:26,465][__main__][INFO] - Iteration 94 took 20s (31.31% Gen, 63.96% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 29m 36s. Estimated total time: 17h 2m 13s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 4s, 500 more iterations: 2h 50m 22s. [2025-11-13 08:36:26,467][__main__][INFO] - Starting iteration 94. [2025-11-13 08:36:26,471][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:26,472][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:32,807][__main__][INFO] - Number of regex retries in iteration 94: 0 [2025-11-13 08:36:32,808][__main__][INFO] - agents played in iteration 94 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:36:33,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:33,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:34,693][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:35,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:35,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:36,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:38,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:39,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:39,886][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:40,536][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:40,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:41,507][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:41,831][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:42,154][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:42,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:43,125][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:43,448][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:43,772][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:44,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:44,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:45,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:45,847][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:45,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:45,851][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:46,814][__main__][INFO] - Iteration 95 took 20s (31.14% Gen, 64.11% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 24m 13s. Estimated total time: 16h 57m 11s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 54s, 500 more iterations: 2h 49m 31s. [2025-11-13 08:36:46,816][__main__][INFO] - Starting iteration 95. [2025-11-13 08:36:46,819][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:46,820][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:53,212][__main__][INFO] - Number of regex retries in iteration 95: 0 [2025-11-13 08:36:53,212][__main__][INFO] - agents played in iteration 95 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:36:53,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,755][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:53,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:54,477][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:54,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:55,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:55,424][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:55,747][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:56,395][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:56,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:57,043][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:57,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:57,692][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:58,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:58,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:58,671][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:59,000][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:59,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:59,978][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:00,300][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:00,624][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:01,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:01,595][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:01,922][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:02,570][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:02,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:03,217][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:03,541][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:03,865][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:04,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:05,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:06,244][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:06,246][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:06,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:07,240][__main__][INFO] - Iteration 96 took 20s (31.30% Gen, 63.83% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 27m 47s. Estimated total time: 17h 1m 5s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 2s, 500 more iterations: 2h 50m 10s. [2025-11-13 08:37:07,242][__main__][INFO] - Starting iteration 96. [2025-11-13 08:37:07,246][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:07,246][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:13,765][__main__][INFO] - Number of regex retries in iteration 96: 0 [2025-11-13 08:37:13,765][__main__][INFO] - agents played in iteration 96 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:37:14,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:14,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:15,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:15,985][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:16,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:17,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:17,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:18,254][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:18,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:18,903][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:19,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:19,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:20,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:20,523][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:21,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:22,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:22,466][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:23,114][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:23,763][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:24,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:25,070][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:25,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:26,095][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:26,820][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:26,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:26,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:27,780][__main__][INFO] - Iteration 97 took 20s (31.75% Gen, 63.59% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 33m 5s. Estimated total time: 17h 6m 44s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 13s, 500 more iterations: 2h 51m 7s. [2025-11-13 08:37:27,782][__main__][INFO] - Starting iteration 97. [2025-11-13 08:37:27,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:27,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:34,227][__main__][INFO] - Number of regex retries in iteration 97: 0 [2025-11-13 08:37:34,228][__main__][INFO] - agents played in iteration 97 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:37:34,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:34,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:35,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:35,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:36,191][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:36,516][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:37,167][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:37,491][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:38,146][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:38,471][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:38,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:39,122][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:39,446][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:39,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:40,417][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:41,066][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:41,389][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:41,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:42,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:42,362][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:43,009][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:43,331][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:43,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:43,978][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:44,301][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:44,625][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:44,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:45,274][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:45,598][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:45,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:46,624][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:47,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:47,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:47,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:48,363][__main__][INFO] - Iteration 98 took 20s (31.30% Gen, 63.79% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 34m 56s. Estimated total time: 17h 8m 55s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 17s, 500 more iterations: 2h 51m 29s. [2025-11-13 08:37:48,365][__main__][INFO] - Starting iteration 98. [2025-11-13 08:37:48,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:48,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:54,859][__main__][INFO] - Number of regex retries in iteration 98: 0 [2025-11-13 08:37:54,860][__main__][INFO] - agents played in iteration 98 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:37:55,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:55,404][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:56,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:56,415][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:56,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:57,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:57,388][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:57,711][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:58,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:59,008][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:59,656][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:59,980][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:00,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:00,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:01,276][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:01,600][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:01,926][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:02,249][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:02,571][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:02,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:03,217][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:03,543][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:05,161][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:05,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:05,810][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:06,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:06,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:07,160][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:07,888][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:07,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:07,891][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:08,848][__main__][INFO] - Iteration 99 took 20s (31.69% Gen, 63.63% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 29m 39s. Estimated total time: 17h 3m 59s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 7s, 500 more iterations: 2h 50m 39s. [2025-11-13 08:38:08,850][__main__][INFO] - Starting iteration 99. [2025-11-13 08:38:08,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:08,854][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:15,391][__main__][INFO] - Number of regex retries in iteration 99: 0 [2025-11-13 08:38:15,392][__main__][INFO] - agents played in iteration 99 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:38:15,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:15,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:15,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:15,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:15,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:15,946][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:16,651][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:16,948][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:17,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:17,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:17,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:18,243][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:18,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:18,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:19,218][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:19,543][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:19,868][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:20,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:20,516][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:20,839][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:21,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:21,488][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:21,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:22,458][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:23,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:24,081][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:25,707][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:26,358][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:26,682][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:27,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:27,716][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:28,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:28,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:28,450][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:29,444][__main__][INFO] - Iteration 100 took 20s (31.75% Gen, 63.41% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 34m 53s. Estimated total time: 17h 9m 34s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 19s, 500 more iterations: 2h 51m 35s. [2025-11-13 08:38:29,446][__main__][INFO] - Starting iteration 100. [2025-11-13 08:38:29,449][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:29,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:35,999][__main__][INFO] - Number of regex retries in iteration 100: 0 [2025-11-13 08:38:36,000][__main__][INFO] - agents played in iteration 100 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:38:36,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:36,545][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:37,277][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:37,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:37,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:38,227][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:38,551][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:39,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:39,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:39,849][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:40,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:40,825][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:41,151][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:41,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:42,445][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:42,768][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:43,092][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:43,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:43,739][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:44,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:44,386][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:44,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:45,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:45,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:46,009][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:46,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:46,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:47,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:47,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:48,354][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:49,084][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:49,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:49,087][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:51,016][__main__][INFO] - Iteration 101 took 21s (30.37% Gen, 60.68% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 23m 21s. Estimated total time: 17h 58m 23s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 56s, 500 more iterations: 2h 59m 43s. [2025-11-13 08:38:51,018][__main__][INFO] - Starting iteration 101. [2025-11-13 08:38:51,023][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:38:51,023][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:57,879][__main__][INFO] - Number of regex retries in iteration 101: 0 [2025-11-13 08:38:57,879][__main__][INFO] - agents played in iteration 101 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:38:58,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:58,433][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:59,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:59,778][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:00,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:01,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:01,411][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:01,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:02,067][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:02,392][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:03,038][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:03,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:03,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:04,336][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:04,660][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:04,984][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:05,308][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:05,633][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:06,281][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:06,605][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:07,253][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:07,578][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:07,903][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:08,227][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:08,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:08,874][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:09,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:10,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:10,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:10,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:10,975][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:11,947][__main__][INFO] - Iteration 102 took 20s (32.77% Gen, 62.58% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 50m 51s. Estimated total time: 17h 26m 14s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 52s, 500 more iterations: 2h 54m 22s. [2025-11-13 08:39:11,949][__main__][INFO] - Starting iteration 102. [2025-11-13 08:39:11,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:11,952][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:18,717][__main__][INFO] - Number of regex retries in iteration 102: 0 [2025-11-13 08:39:18,718][__main__][INFO] - agents played in iteration 102 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:39:19,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:19,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:19,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:20,285][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:20,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:21,258][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:21,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:22,237][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:22,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:23,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:23,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:24,179][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:24,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:24,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:25,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:25,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:25,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:26,124][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:26,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:26,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:27,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:27,743][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:28,067][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:28,715][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:29,362][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:29,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:30,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:31,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:31,787][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:31,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:31,790][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:32,746][__main__][INFO] - Iteration 103 took 20s (32.54% Gen, 62.86% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 44m 1s. Estimated total time: 17h 19m 44s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 39s, 500 more iterations: 2h 53m 17s. [2025-11-13 08:39:32,748][__main__][INFO] - Starting iteration 103. [2025-11-13 08:39:32,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:32,752][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:39,452][__main__][INFO] - Number of regex retries in iteration 103: 0 [2025-11-13 08:39:39,452][__main__][INFO] - agents played in iteration 103 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:39:39,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:39,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:39,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:40,006][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:40,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:41,687][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:42,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:42,663][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:43,317][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:43,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:43,969][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:44,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:44,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:45,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:45,925][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:46,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:46,579][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:46,903][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:47,874][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:48,201][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:48,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:48,849][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:49,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:49,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:49,826][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:50,477][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:51,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:51,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:52,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:52,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:52,569][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:53,533][__main__][INFO] - Iteration 104 took 20s (32.24% Gen, 63.11% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 43m 3s. Estimated total time: 17h 19m 8s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 38s, 500 more iterations: 2h 53m 11s. [2025-11-13 08:39:53,535][__main__][INFO] - Starting iteration 104. [2025-11-13 08:39:53,538][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:53,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:00,379][__main__][INFO] - Number of regex retries in iteration 104: 0 [2025-11-13 08:40:00,379][__main__][INFO] - agents played in iteration 104 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:40:00,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:00,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:00,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:00,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:00,926][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:00,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:03,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:03,908][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:04,891][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:05,215][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:05,538][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:05,865][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:06,190][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:06,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:06,845][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:07,171][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:07,498][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:07,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:08,151][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:08,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:08,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:09,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:09,463][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:10,124][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:10,778][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:11,105][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:11,761][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:12,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:12,813][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:13,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:13,548][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:13,549][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:14,570][__main__][INFO] - Iteration 105 took 21s (32.52% Gen, 62.62% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 55m 12s. Estimated total time: 17h 31m 38s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 3s, 500 more iterations: 2h 55m 16s. [2025-11-13 08:40:14,572][__main__][INFO] - Starting iteration 105. [2025-11-13 08:40:14,575][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:14,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:21,237][__main__][INFO] - Number of regex retries in iteration 105: 0 [2025-11-13 08:40:21,237][__main__][INFO] - agents played in iteration 105 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:40:21,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:21,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:22,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:23,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:23,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:24,107][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:24,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:24,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:25,082][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:25,732][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:26,378][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:26,703][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:27,027][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:27,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:27,682][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:28,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:29,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:29,627][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:29,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:30,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:30,921][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:31,571][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:31,896][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:32,222][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:32,546][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:32,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:33,591][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:34,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:34,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:34,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:35,336][__main__][INFO] - Iteration 106 took 20s (32.08% Gen, 63.06% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 41m 19s. Estimated total time: 17h 18m 5s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 36s, 500 more iterations: 2h 53m 0s. [2025-11-13 08:40:35,338][__main__][INFO] - Starting iteration 106. [2025-11-13 08:40:35,342][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:35,342][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:42,014][__main__][INFO] - Number of regex retries in iteration 106: 0 [2025-11-13 08:40:42,015][__main__][INFO] - agents played in iteration 106 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:40:42,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,556][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:42,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:43,578][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:43,904][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:44,228][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:44,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:44,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:45,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:45,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:46,174][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:47,478][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:48,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:48,449][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:48,773][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:49,098][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:50,395][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:50,719][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:51,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:52,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:52,340][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:52,664][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:53,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:53,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:54,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:55,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:55,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:55,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:56,062][__main__][INFO] - Iteration 107 took 20s (32.20% Gen, 63.09% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 38m 55s. Estimated total time: 17h 16m 2s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 32s, 500 more iterations: 2h 52m 40s. [2025-11-13 08:40:56,064][__main__][INFO] - Starting iteration 107. [2025-11-13 08:40:56,067][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:56,067][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:02,799][__main__][INFO] - Number of regex retries in iteration 107: 0 [2025-11-13 08:41:02,800][__main__][INFO] - agents played in iteration 107 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:41:03,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:03,346][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:04,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:04,689][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:05,012][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:05,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:05,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:06,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:06,636][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:06,960][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:07,283][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:08,255][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:08,579][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:09,228][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:09,552][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:10,529][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:11,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:11,502][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:12,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:13,122][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:13,447][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:13,772][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:14,096][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:14,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:15,145][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:15,890][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:15,892][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:15,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:16,883][__main__][INFO] - Iteration 108 took 20s (32.34% Gen, 62.90% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 43m 21s. Estimated total time: 17h 20m 49s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 41s, 500 more iterations: 2h 53m 28s. [2025-11-13 08:41:16,885][__main__][INFO] - Starting iteration 108. [2025-11-13 08:41:16,888][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:16,888][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:23,627][__main__][INFO] - Number of regex retries in iteration 108: 0 [2025-11-13 08:41:23,628][__main__][INFO] - agents played in iteration 108 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:41:24,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:24,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:24,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:25,534][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:25,858][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:26,182][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:26,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:26,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:28,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:28,452][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:28,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:29,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:30,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:30,401][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:31,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:32,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:33,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:33,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:33,650][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:34,298][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:34,623][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:34,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:35,274][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:35,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:36,740][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:36,741][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:36,743][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:37,730][__main__][INFO] - Iteration 109 took 20s (32.33% Gen, 62.92% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 44m 20s. Estimated total time: 17h 22m 8s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 44s, 500 more iterations: 2h 53m 41s. [2025-11-13 08:41:37,732][__main__][INFO] - Starting iteration 109. [2025-11-13 08:41:37,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:37,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:44,520][__main__][INFO] - Number of regex retries in iteration 109: 0 [2025-11-13 08:41:44,521][__main__][INFO] - agents played in iteration 109 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:41:44,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:44,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,063][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:45,064][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:45,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:46,081][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:46,405][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:46,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:47,058][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:47,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:47,707][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:48,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:48,689][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:49,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:49,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:49,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:49,990][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:50,969][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:51,293][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:51,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:52,280][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:52,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:52,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:53,253][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:54,554][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:54,879][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:55,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:55,528][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:55,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:56,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:56,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:57,628][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:57,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:57,631][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:58,609][__main__][INFO] - Iteration 110 took 20s (32.50% Gen, 62.81% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 45m 31s. Estimated total time: 17h 23m 40s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 47s, 500 more iterations: 2h 53m 56s. [2025-11-13 08:41:58,611][__main__][INFO] - Starting iteration 110. [2025-11-13 08:41:58,613][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:58,614][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:05,325][__main__][INFO] - Number of regex retries in iteration 110: 0 [2025-11-13 08:42:05,325][__main__][INFO] - agents played in iteration 110 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:42:05,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,867][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:05,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:06,892][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:07,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:07,870][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:08,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:08,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:08,847][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:09,172][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:09,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:09,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:10,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:10,467][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:11,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:11,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:12,413][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:12,738][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:13,062][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:14,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:14,359][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:15,006][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:15,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:16,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:16,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:17,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:18,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:18,423][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:18,425][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:20,405][__main__][INFO] - Iteration 111 took 21s (30.80% Gen, 60.11% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 31m 7s. Estimated total time: 18h 9m 38s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 19s, 500 more iterations: 3h 1m 36s. [2025-11-13 08:42:20,408][__main__][INFO] - Starting iteration 111. [2025-11-13 08:42:20,411][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:20,411][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:27,570][__main__][INFO] - Number of regex retries in iteration 111: 0 [2025-11-13 08:42:27,570][__main__][INFO] - agents played in iteration 111 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:42:28,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:28,115][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:29,146][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:29,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:29,794][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:30,118][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:30,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:30,768][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:31,416][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:32,067][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:32,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:33,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:33,688][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:34,669][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:34,994][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:35,320][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:35,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:36,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:36,624][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:36,950][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:37,276][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:37,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:38,903][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:39,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:39,952][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:40,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:40,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:40,687][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:41,655][__main__][INFO] - Iteration 112 took 21s (33.70% Gen, 61.74% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 3m 23s. Estimated total time: 17h 42m 16s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 24s, 500 more iterations: 2h 57m 2s. [2025-11-13 08:42:41,657][__main__][INFO] - Starting iteration 112. [2025-11-13 08:42:41,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:41,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:48,559][__main__][INFO] - Number of regex retries in iteration 112: 0 [2025-11-13 08:42:48,560][__main__][INFO] - agents played in iteration 112 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:42:49,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:49,112][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:49,841][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:50,137][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:50,787][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:51,437][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:52,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:52,733][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:53,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:53,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:54,353][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:54,678][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:55,327][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:55,651][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:55,974][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:56,298][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:56,624][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:56,953][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:57,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:57,604][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:58,258][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:58,588][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:58,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:59,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:59,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:00,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:00,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:01,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:01,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:01,658][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:02,656][__main__][INFO] - Iteration 113 took 20s (32.86% Gen, 62.39% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 50m 33s. Estimated total time: 17h 29m 47s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 59s, 500 more iterations: 2h 54m 57s. [2025-11-13 08:43:02,658][__main__][INFO] - Starting iteration 113. [2025-11-13 08:43:02,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:02,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:09,673][__main__][INFO] - Number of regex retries in iteration 113: 0 [2025-11-13 08:43:09,674][__main__][INFO] - agents played in iteration 113 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:43:10,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,218][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:10,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:10,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:11,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:11,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:12,862][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:13,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:14,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:14,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:15,141][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:15,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:16,114][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:16,440][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:16,767][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:17,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:17,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:18,075][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:18,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:18,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:19,048][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:19,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:20,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:20,352][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:20,677][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:21,002][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:21,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:22,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:22,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:22,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:22,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:23,776][__main__][INFO] - Iteration 114 took 21s (33.21% Gen, 62.18% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 56m 13s. Estimated total time: 17h 35m 48s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 11s, 500 more iterations: 2h 55m 58s. [2025-11-13 08:43:23,778][__main__][INFO] - Starting iteration 114. [2025-11-13 08:43:23,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:23,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:30,610][__main__][INFO] - Number of regex retries in iteration 114: 0 [2025-11-13 08:43:30,611][__main__][INFO] - agents played in iteration 114 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:43:31,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:31,174][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:32,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:32,836][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:33,166][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:33,491][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:33,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:35,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:35,763][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:36,086][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:37,058][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:37,711][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:38,683][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:39,008][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:39,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:40,636][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:40,960][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:41,616][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:41,945][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:42,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:42,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:43,713][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:43,715][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:43,717][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:44,690][__main__][INFO] - Iteration 115 took 20s (32.66% Gen, 62.68% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 45m 33s. Estimated total time: 17h 25m 29s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 50s, 500 more iterations: 2h 54m 14s. [2025-11-13 08:43:44,693][__main__][INFO] - Starting iteration 115. [2025-11-13 08:43:44,697][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:44,697][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:51,753][__main__][INFO] - Number of regex retries in iteration 115: 0 [2025-11-13 08:43:51,753][__main__][INFO] - agents played in iteration 115 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:43:52,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:52,301][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:53,320][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:53,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:54,625][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:54,952][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:55,277][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:55,600][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:56,249][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:56,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:56,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:57,228][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:57,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:58,204][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:58,529][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:58,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:59,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:59,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:59,828][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:00,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:00,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:01,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:02,110][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:02,433][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:02,757][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:03,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:03,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:04,114][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:04,841][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:04,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:04,844][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:05,843][__main__][INFO] - Iteration 116 took 21s (33.37% Gen, 61.91% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 57m 4s. Estimated total time: 17h 37m 20s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 14s, 500 more iterations: 2h 56m 13s. [2025-11-13 08:44:05,845][__main__][INFO] - Starting iteration 116. [2025-11-13 08:44:05,847][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:05,848][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:12,859][__main__][INFO] - Number of regex retries in iteration 116: 0 [2025-11-13 08:44:12,860][__main__][INFO] - agents played in iteration 116 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:44:13,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:13,414][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:14,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:14,443][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:14,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:15,092][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:15,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:16,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:16,711][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:17,035][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:17,358][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:17,682][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:18,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:18,329][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:18,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:19,300][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:19,624][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:20,274][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:20,598][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:21,572][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:22,219][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:22,545][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:23,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:23,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:24,177][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:24,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:25,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:25,954][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:25,955][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:25,957][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:26,925][__main__][INFO] - Iteration 117 took 21s (33.27% Gen, 62.13% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 53m 17s. Estimated total time: 17h 33m 55s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 7s, 500 more iterations: 2h 55m 39s. [2025-11-13 08:44:26,927][__main__][INFO] - Starting iteration 117. [2025-11-13 08:44:26,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:26,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:33,809][__main__][INFO] - Number of regex retries in iteration 117: 0 [2025-11-13 08:44:33,810][__main__][INFO] - agents played in iteration 117 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:44:34,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:34,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:35,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:35,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:36,031][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:36,354][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:36,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:37,008][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:37,333][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:38,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:38,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:38,964][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:39,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:39,614][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:39,938][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:40,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:40,587][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:40,911][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:41,558][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:42,530][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:42,854][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:43,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:43,504][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:43,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:44,154][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:44,801][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:45,125][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:45,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:46,164][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:46,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:46,900][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:46,901][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:47,875][__main__][INFO] - Iteration 118 took 20s (32.83% Gen, 62.51% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 46m 15s. Estimated total time: 17h 27m 14s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 54s, 500 more iterations: 2h 54m 32s. [2025-11-13 08:44:47,877][__main__][INFO] - Starting iteration 118. [2025-11-13 08:44:47,881][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:47,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:54,827][__main__][INFO] - Number of regex retries in iteration 118: 0 [2025-11-13 08:44:54,828][__main__][INFO] - agents played in iteration 118 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:44:55,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,376][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:55,377][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:56,115][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:56,411][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:57,072][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:57,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:57,722][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:58,699][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:59,024][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:59,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:59,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:01,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:01,945][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:02,271][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:03,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:03,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:04,218][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:04,543][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:04,867][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:05,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:05,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:06,162][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:06,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:07,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:07,942][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:07,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:07,945][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:08,931][__main__][INFO] - Iteration 119 took 21s (33.00% Gen, 62.31% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 51m 13s. Estimated total time: 17h 32m 33s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 5s, 500 more iterations: 2h 55m 25s. [2025-11-13 08:45:08,933][__main__][INFO] - Starting iteration 119. [2025-11-13 08:45:08,936][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:08,937][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:15,896][__main__][INFO] - Number of regex retries in iteration 119: 0 [2025-11-13 08:45:15,897][__main__][INFO] - agents played in iteration 119 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:45:16,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,455][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:16,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:17,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:18,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:18,781][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:19,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:20,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:20,732][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:21,057][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:21,382][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:21,706][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:22,031][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:22,680][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:23,004][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:24,305][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:24,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:24,955][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:25,279][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:25,603][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:25,928][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:26,252][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:26,577][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:26,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:27,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:27,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:28,272][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:29,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:29,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:29,020][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:30,005][__main__][INFO] - Iteration 120 took 21s (33.04% Gen, 62.28% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 51m 47s. Estimated total time: 17h 33m 27s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 6s, 500 more iterations: 2h 55m 34s. [2025-11-13 08:45:30,007][__main__][INFO] - Starting iteration 120. [2025-11-13 08:45:30,010][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:30,011][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:36,941][__main__][INFO] - Number of regex retries in iteration 120: 0 [2025-11-13 08:45:36,942][__main__][INFO] - agents played in iteration 120 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:45:37,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,488][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:37,489][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:38,210][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:38,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:39,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:39,481][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:39,806][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:40,456][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:40,781][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:41,106][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:41,433][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:41,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:42,082][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:42,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:42,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:43,058][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:43,381][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:44,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:44,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:45,654][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:46,629][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:46,954][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:47,604][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:47,929][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:48,255][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:48,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:49,301][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:50,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:50,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:50,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:51,997][__main__][INFO] - Iteration 121 took 21s (31.52% Gen, 59.61% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 37m 21s. Estimated total time: 18h 19m 24s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 38s, 500 more iterations: 3h 3m 14s. [2025-11-13 08:45:51,999][__main__][INFO] - Starting iteration 121. [2025-11-13 08:45:52,002][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:45:52,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:59,488][__main__][INFO] - Number of regex retries in iteration 121: 0 [2025-11-13 08:45:59,489][__main__][INFO] - agents played in iteration 121 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:45:59,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:59,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,050][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:00,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:00,772][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:01,069][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:01,395][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:01,718][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:02,692][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:03,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:03,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:04,000][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:04,325][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:04,974][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:05,300][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:05,623][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:06,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:08,223][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:08,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:08,875][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:09,525][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:10,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:10,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:10,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:11,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:11,871][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:12,620][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:12,621][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:12,623][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:13,616][__main__][INFO] - Iteration 122 took 21s (34.63% Gen, 60.77% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 18s. Estimated total time: 18h 0m 43s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 1s, 500 more iterations: 3h 0m 7s. [2025-11-13 08:46:13,618][__main__][INFO] - Starting iteration 122. [2025-11-13 08:46:13,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:13,621][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:20,918][__main__][INFO] - Number of regex retries in iteration 122: 0 [2025-11-13 08:46:20,919][__main__][INFO] - agents played in iteration 122 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:46:21,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,468][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:21,468][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:22,210][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:22,832][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:23,480][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:23,807][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:24,133][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:24,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:25,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:25,443][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:25,768][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:26,744][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:27,720][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:29,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:29,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:29,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:30,326][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:30,651][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:30,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:31,301][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:32,278][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:32,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:33,337][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:34,074][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:34,076][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:34,077][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:35,058][__main__][INFO] - Iteration 123 took 21s (34.04% Gen, 61.38% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 9m 8s. Estimated total time: 17h 51m 54s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 43s, 500 more iterations: 2h 58m 39s. [2025-11-13 08:46:35,061][__main__][INFO] - Starting iteration 123. [2025-11-13 08:46:35,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:35,065][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:42,259][__main__][INFO] - Number of regex retries in iteration 123: 0 [2025-11-13 08:46:42,259][__main__][INFO] - agents played in iteration 123 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:46:42,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,811][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:42,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:43,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:43,835][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:44,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:44,808][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:45,137][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:45,462][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:45,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:46,113][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:46,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:46,764][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:47,091][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:47,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:48,064][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:49,036][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:49,361][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:49,690][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:50,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:50,666][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:50,990][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:51,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:51,964][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:52,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:53,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:53,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:54,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:55,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:55,379][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:55,381][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:56,371][__main__][INFO] - Iteration 124 took 21s (33.76% Gen, 61.58% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 2m 14s. Estimated total time: 17h 45m 21s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 30s, 500 more iterations: 2h 57m 33s. [2025-11-13 08:46:56,373][__main__][INFO] - Starting iteration 124. [2025-11-13 08:46:56,376][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:56,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:03,702][__main__][INFO] - Number of regex retries in iteration 124: 0 [2025-11-13 08:47:03,702][__main__][INFO] - agents played in iteration 124 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:47:04,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,251][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:04,251][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:04,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:05,600][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:05,924][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:06,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:07,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:07,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:08,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:08,530][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:08,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:09,178][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:09,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:10,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:10,478][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:11,128][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:11,452][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:12,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:12,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:12,746][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:13,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:13,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:14,692][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:15,017][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:15,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:16,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:16,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:16,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:16,812][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:17,809][__main__][INFO] - Iteration 125 took 21s (34.18% Gen, 61.16% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 8m 15s. Estimated total time: 17h 51m 43s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 43s, 500 more iterations: 2h 58m 37s. [2025-11-13 08:47:17,811][__main__][INFO] - Starting iteration 125. [2025-11-13 08:47:17,815][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:17,815][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:25,114][__main__][INFO] - Number of regex retries in iteration 125: 0 [2025-11-13 08:47:25,115][__main__][INFO] - agents played in iteration 125 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:47:25,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,674][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:25,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:26,391][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:26,687][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:27,012][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:27,986][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:28,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:28,639][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:28,964][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:29,615][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:30,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:30,913][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:31,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:31,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:32,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:33,526][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:35,159][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:36,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:37,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:38,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:38,221][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:38,223][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:39,173][__main__][INFO] - Iteration 126 took 21s (34.18% Gen, 61.37% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 4m 6s. Estimated total time: 17h 47m 56s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 35s, 500 more iterations: 2h 57m 59s. [2025-11-13 08:47:39,175][__main__][INFO] - Starting iteration 126. [2025-11-13 08:47:39,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:39,179][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:46,315][__main__][INFO] - Number of regex retries in iteration 126: 0 [2025-11-13 08:47:46,316][__main__][INFO] - agents played in iteration 126 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:47:46,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:46,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:47,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:47,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:48,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:48,534][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:49,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:49,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:50,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:50,478][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:51,126][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:51,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:51,773][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:52,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:53,068][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:53,393][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:54,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:55,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:55,340][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:55,669][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:56,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:56,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:56,964][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:57,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:57,612][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:57,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:58,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:59,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:59,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:59,388][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:00,365][__main__][INFO] - Iteration 127 took 21s (33.68% Gen, 61.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 55m 12s. Estimated total time: 17h 39m 23s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 18s, 500 more iterations: 2h 56m 33s. [2025-11-13 08:48:00,368][__main__][INFO] - Starting iteration 127. [2025-11-13 08:48:00,371][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:00,371][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:07,829][__main__][INFO] - Number of regex retries in iteration 127: 0 [2025-11-13 08:48:07,830][__main__][INFO] - agents played in iteration 127 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:48:08,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,375][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:08,375][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:09,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:09,387][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:09,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:10,371][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:10,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:11,990][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:12,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:12,962][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:13,285][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:13,608][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:13,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:14,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:14,583][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:14,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:15,231][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:15,555][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:15,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:16,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:16,527][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:16,850][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:17,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:17,497][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:17,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:18,147][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:18,793][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:19,117][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:19,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:20,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:20,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:20,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:20,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:21,832][__main__][INFO] - Iteration 128 took 21s (34.75% Gen, 60.88% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 8m 33s. Estimated total time: 17h 53m 6s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 46s, 500 more iterations: 2h 58m 51s. [2025-11-13 08:48:21,835][__main__][INFO] - Starting iteration 128. [2025-11-13 08:48:21,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:21,838][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:29,208][__main__][INFO] - Number of regex retries in iteration 128: 0 [2025-11-13 08:48:29,208][__main__][INFO] - agents played in iteration 128 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:48:29,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,757][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:29,757][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:30,483][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:30,778][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:31,427][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:32,076][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:32,400][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:32,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:33,049][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:33,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:33,697][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:34,022][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:34,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:34,671][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:35,321][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:35,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:36,295][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:36,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:36,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:37,913][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:38,237][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:38,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:39,208][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:39,533][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:40,181][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:40,505][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:40,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:41,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:42,259][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:42,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:42,262][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:43,221][__main__][INFO] - Iteration 129 took 21s (34.47% Gen, 61.04% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 4m 19s. Estimated total time: 17h 49m 13s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 38s, 500 more iterations: 2h 58m 12s. [2025-11-13 08:48:43,223][__main__][INFO] - Starting iteration 129. [2025-11-13 08:48:43,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:43,226][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:50,371][__main__][INFO] - Number of regex retries in iteration 129: 0 [2025-11-13 08:48:50,372][__main__][INFO] - agents played in iteration 129 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:48:50,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,926][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:50,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:51,642][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:52,263][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:52,590][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:52,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:53,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:53,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:53,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:54,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:54,545][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:55,194][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:55,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:55,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:56,167][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:56,491][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:56,815][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:57,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:57,791][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:58,446][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:58,772][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:59,745][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:00,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:00,395][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:00,719][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:01,042][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:01,691][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:02,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:02,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:03,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:03,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:03,460][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:04,429][__main__][INFO] - Iteration 130 took 21s (33.70% Gen, 61.73% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 54m 56s. Estimated total time: 17h 40m 12s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 20s, 500 more iterations: 2h 56m 42s. [2025-11-13 08:49:04,431][__main__][INFO] - Starting iteration 130. [2025-11-13 08:49:04,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:49:04,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:11,694][__main__][INFO] - Number of regex retries in iteration 130: 0 [2025-11-13 08:49:11,695][__main__][INFO] - agents played in iteration 130 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:49:12,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:12,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:14,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:14,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:14,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:15,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:16,835][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:17,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:17,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:19,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:19,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:19,748][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:20,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:20,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:21,369][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:21,692][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:22,015][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:22,339][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:22,663][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:23,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:24,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:24,744][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:24,745][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:24,747][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:26,660][__main__][INFO] - Iteration 131 took 22s (32.66% Gen, 58.72% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 45m 44s. Estimated total time: 18h 31m 21s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 2s, 500 more iterations: 3h 5m 13s. [2025-11-13 08:49:26,662][__main__][INFO] - Starting iteration 131. [2025-11-13 08:49:26,666][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:26,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:34,592][__main__][INFO] - Number of regex retries in iteration 131: 0 [2025-11-13 08:49:34,593][__main__][INFO] - agents played in iteration 131 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:49:35,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:35,140][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:35,931][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:36,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:37,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:38,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:38,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:38,850][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:39,174][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:39,497][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:40,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:40,469][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:40,792][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:41,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:42,089][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:42,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:42,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:43,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:44,033][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:44,358][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:44,682][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:45,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:45,654][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:46,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:47,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:47,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:47,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:47,779][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:48,739][__main__][INFO] - Iteration 132 took 22s (35.91% Gen, 59.73% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 37m 44s. Estimated total time: 18h 23m 43s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 47s, 500 more iterations: 3h 3m 57s. [2025-11-13 08:49:48,742][__main__][INFO] - Starting iteration 132. [2025-11-13 08:49:48,745][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:48,746][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:56,397][__main__][INFO] - Number of regex retries in iteration 132: 0 [2025-11-13 08:49:56,398][__main__][INFO] - agents played in iteration 132 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:49:56,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:56,942][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:57,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:58,299][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:58,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:59,278][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:59,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:00,254][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:00,577][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:00,901][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:01,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:01,876][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:02,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:02,524][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:02,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:03,173][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:03,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:03,822][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:04,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:04,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:05,119][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:06,419][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:07,390][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:07,714][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:08,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:08,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:09,479][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:09,481][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:09,482][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:10,435][__main__][INFO] - Iteration 133 took 21s (35.27% Gen, 60.32% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 12s. Estimated total time: 18h 4m 33s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 9s, 500 more iterations: 3h 0m 45s. [2025-11-13 08:50:10,437][__main__][INFO] - Starting iteration 133. [2025-11-13 08:50:10,440][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:10,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:18,013][__main__][INFO] - Number of regex retries in iteration 133: 0 [2025-11-13 08:50:18,014][__main__][INFO] - agents played in iteration 133 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:50:18,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:18,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:19,286][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:19,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:19,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:20,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:20,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:21,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:21,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:22,203][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:22,853][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:23,180][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:23,504][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:23,828][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:24,152][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:24,477][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:24,801][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:25,449][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:25,773][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:26,100][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:26,425][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:27,722][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:28,697][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:29,021][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:29,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:29,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:30,384][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:31,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:31,114][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:31,115][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:32,065][__main__][INFO] - Iteration 134 took 21s (35.01% Gen, 60.58% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 14m 34s. Estimated total time: 18h 1m 17s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 2s, 500 more iterations: 3h 0m 12s. [2025-11-13 08:50:32,067][__main__][INFO] - Starting iteration 134. [2025-11-13 08:50:32,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:32,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:39,273][__main__][INFO] - Number of regex retries in iteration 134: 0 [2025-11-13 08:50:39,274][__main__][INFO] - agents played in iteration 134 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:50:39,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:39,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:39,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:39,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:39,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:39,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:40,840][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:41,166][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:41,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:41,816][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:42,140][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:42,465][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:43,114][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:43,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:43,764][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:44,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:44,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:45,064][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:45,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:46,036][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:46,362][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:46,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:47,017][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:47,345][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:47,675][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:48,000][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:48,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:48,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:49,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:49,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:49,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:50,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:50,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:50,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:51,629][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:52,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:52,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:52,357][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:53,383][__main__][INFO] - Iteration 135 took 21s (33.79% Gen, 61.39% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 58m 34s. Estimated total time: 17h 45m 38s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 31s, 500 more iterations: 2h 57m 36s. [2025-11-13 08:50:53,385][__main__][INFO] - Starting iteration 135. [2025-11-13 08:50:53,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:53,389][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:01,009][__main__][INFO] - Number of regex retries in iteration 135: 0 [2025-11-13 08:51:01,010][__main__][INFO] - agents played in iteration 135 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:51:01,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,558][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:01,558][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:02,271][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:02,895][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:03,219][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:03,541][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:04,192][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:04,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:04,839][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:05,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:05,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:05,813][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:06,463][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:06,789][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:07,115][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:08,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:08,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:08,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:09,079][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:09,405][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:09,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:10,055][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:10,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:11,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:11,358][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:12,331][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:12,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:13,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:14,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:14,088][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:14,090][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:15,084][__main__][INFO] - Iteration 136 took 21s (35.12% Gen, 60.29% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 17m 22s. Estimated total time: 18h 4m 48s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 9s, 500 more iterations: 3h 0m 48s. [2025-11-13 08:51:15,086][__main__][INFO] - Starting iteration 136. [2025-11-13 08:51:15,089][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:15,089][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:22,711][__main__][INFO] - Number of regex retries in iteration 136: 0 [2025-11-13 08:51:22,712][__main__][INFO] - agents played in iteration 136 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:51:23,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:23,275][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:23,998][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:24,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:24,620][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:24,944][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:25,596][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:25,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:26,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:26,570][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:26,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:27,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:27,868][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:28,516][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:28,844][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:29,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:30,143][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:30,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:30,790][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:31,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:31,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:31,762][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:32,417][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:32,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:33,072][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:33,398][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:33,723][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:34,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:34,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:35,082][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:35,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:35,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:35,811][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:36,776][__main__][INFO] - Iteration 137 took 21s (35.15% Gen, 60.40% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 36s. Estimated total time: 18h 4m 23s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 8s, 500 more iterations: 3h 0m 43s. [2025-11-13 08:51:36,778][__main__][INFO] - Starting iteration 137. [2025-11-13 08:51:36,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:36,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:44,321][__main__][INFO] - Number of regex retries in iteration 137: 0 [2025-11-13 08:51:44,322][__main__][INFO] - agents played in iteration 137 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:51:44,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,868][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:44,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:45,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:45,888][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:46,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:46,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:46,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:47,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:47,512][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:48,487][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:48,811][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:49,137][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:49,461][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:50,112][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:50,437][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:50,761][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:51,087][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:51,411][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:51,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:52,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:52,392][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:52,716][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:53,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:53,689][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:54,014][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:54,987][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:55,312][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:55,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:55,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:56,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:57,405][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:57,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:57,408][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:58,385][__main__][INFO] - Iteration 138 took 21s (34.90% Gen, 60.57% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 12m 3s. Estimated total time: 18h 0m 12s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 0s, 500 more iterations: 3h 0m 2s. [2025-11-13 08:51:58,387][__main__][INFO] - Starting iteration 138. [2025-11-13 08:51:58,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:58,390][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:05,703][__main__][INFO] - Number of regex retries in iteration 138: 0 [2025-11-13 08:52:05,703][__main__][INFO] - agents played in iteration 138 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:52:06,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:06,270][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:07,288][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:07,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:08,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:08,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:09,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:09,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:09,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:10,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:10,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:11,515][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:12,163][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:12,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:12,811][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:13,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:13,786][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:14,436][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:14,760][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:15,085][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:15,411][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:16,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:16,384][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:16,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:17,032][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:17,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:18,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:18,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:18,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:18,793][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:19,779][__main__][INFO] - Iteration 139 took 21s (34.19% Gen, 61.20% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 1m 0s. Estimated total time: 17h 49m 31s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 39s, 500 more iterations: 2h 58m 15s. [2025-11-13 08:52:19,781][__main__][INFO] - Starting iteration 139. [2025-11-13 08:52:19,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:19,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:27,481][__main__][INFO] - Number of regex retries in iteration 139: 0 [2025-11-13 08:52:27,482][__main__][INFO] - agents played in iteration 139 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:52:27,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:27,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:27,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,032][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:28,032][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:28,757][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:29,098][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:30,416][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:31,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:32,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:33,351][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:34,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:34,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:34,662][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:34,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:35,965][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:36,296][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:36,946][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:37,596][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:37,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:38,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:38,894][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:39,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:39,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:40,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:40,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:40,681][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:41,662][__main__][INFO] - Iteration 140 took 21s (35.18% Gen, 60.33% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 25m 1s. Estimated total time: 18h 13m 53s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 27s, 500 more iterations: 3h 2m 18s. [2025-11-13 08:52:41,664][__main__][INFO] - Starting iteration 140. [2025-11-13 08:52:41,668][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:41,668][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:49,049][__main__][INFO] - Number of regex retries in iteration 140: 0 [2025-11-13 08:52:49,050][__main__][INFO] - agents played in iteration 140 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:52:49,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:49,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:50,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:51,589][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:51,916][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:52,241][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:53,216][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:53,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:54,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:55,164][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:55,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:56,138][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:57,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:58,412][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:58,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:59,062][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:59,712][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:00,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:01,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:02,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:02,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:02,127][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:04,038][__main__][INFO] - Iteration 141 took 22s (33.00% Gen, 58.46% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 49m 18s. Estimated total time: 18h 38m 33s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 17s, 500 more iterations: 3h 6m 25s. [2025-11-13 08:53:04,040][__main__][INFO] - Starting iteration 141. [2025-11-13 08:53:04,044][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:04,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:12,064][__main__][INFO] - Number of regex retries in iteration 141: 0 [2025-11-13 08:53:12,065][__main__][INFO] - agents played in iteration 141 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:53:12,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,608][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:12,608][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:13,331][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:13,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:14,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:14,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:14,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:15,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:15,912][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:16,564][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:17,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:17,537][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:17,862][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:18,188][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:18,514][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:19,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:19,487][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:20,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:20,788][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:21,114][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:21,439][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:21,766][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:22,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:22,418][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:23,393][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:23,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:24,429][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:25,159][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:25,160][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:25,162][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:26,179][__main__][INFO] - Iteration 142 took 22s (36.23% Gen, 59.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 12s. Estimated total time: 18h 26m 49s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 53s, 500 more iterations: 3h 4m 28s. [2025-11-13 08:53:26,182][__main__][INFO] - Starting iteration 142. [2025-11-13 08:53:26,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:26,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:33,609][__main__][INFO] - Number of regex retries in iteration 142: 0 [2025-11-13 08:53:33,610][__main__][INFO] - agents played in iteration 142 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:53:34,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,155][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:34,155][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:34,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:35,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:36,484][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:37,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:37,793][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:38,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:39,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:39,756][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:40,081][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:40,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:40,730][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:41,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:42,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:42,366][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:42,693][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:43,017][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:43,347][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:43,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:44,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:44,650][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:45,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:46,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:46,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:46,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:46,761][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:47,736][__main__][INFO] - Iteration 143 took 21s (34.45% Gen, 61.02% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 35s. Estimated total time: 17h 57m 34s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 55s, 500 more iterations: 2h 59m 35s. [2025-11-13 08:53:47,738][__main__][INFO] - Starting iteration 143. [2025-11-13 08:53:47,741][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:47,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:55,509][__main__][INFO] - Number of regex retries in iteration 143: 0 [2025-11-13 08:53:55,510][__main__][INFO] - agents played in iteration 143 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:53:55,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:55,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:56,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:56,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:57,099][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:57,424][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:57,747][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:58,071][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:58,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:59,043][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:59,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:00,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:00,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:00,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:01,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:01,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:02,288][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:02,613][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:02,939][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:03,587][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:03,913][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:04,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:04,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:04,896][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:05,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:05,550][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:05,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:06,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:06,847][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:07,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:07,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:08,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:08,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:08,647][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:09,619][__main__][INFO] - Iteration 144 took 21s (35.50% Gen, 60.05% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 23m 34s. Estimated total time: 18h 13m 55s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 27s, 500 more iterations: 3h 2m 19s. [2025-11-13 08:54:09,621][__main__][INFO] - Starting iteration 144. [2025-11-13 08:54:09,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:09,624][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:17,491][__main__][INFO] - Number of regex retries in iteration 144: 0 [2025-11-13 08:54:17,492][__main__][INFO] - agents played in iteration 144 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:54:17,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:17,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:18,043][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:19,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:19,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:20,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:20,355][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:20,678][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:21,002][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:21,326][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:21,652][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:21,976][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:22,625][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:22,951][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:23,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:23,602][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:23,927][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:24,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:24,576][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:24,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:25,554][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:25,879][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:26,208][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:27,189][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:27,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:27,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:28,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:28,497][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:28,822][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:29,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:29,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:30,590][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:30,592][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:30,593][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:31,589][__main__][INFO] - Iteration 145 took 21s (35.82% Gen, 59.65% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 34s. Estimated total time: 18h 18m 16s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 36s, 500 more iterations: 3h 3m 2s. [2025-11-13 08:54:31,591][__main__][INFO] - Starting iteration 145. [2025-11-13 08:54:31,594][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:31,595][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:39,523][__main__][INFO] - Number of regex retries in iteration 145: 0 [2025-11-13 08:54:39,524][__main__][INFO] - agents played in iteration 145 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:54:39,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,069][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:40,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:40,793][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:41,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:42,065][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:42,390][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:42,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:43,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:43,689][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:44,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:44,339][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:44,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:44,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:45,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:45,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:45,960][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:46,291][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:46,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:46,938][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:47,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:48,237][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:48,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:49,211][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:49,534][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:49,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:50,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:50,506][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:50,831][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:51,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:51,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:52,598][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:52,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:52,602][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:53,574][__main__][INFO] - Iteration 146 took 21s (36.07% Gen, 59.50% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 58s. Estimated total time: 18h 19m 2s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 38s, 500 more iterations: 3h 3m 10s. [2025-11-13 08:54:53,576][__main__][INFO] - Starting iteration 146. [2025-11-13 08:54:53,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:53,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:01,280][__main__][INFO] - Number of regex retries in iteration 146: 0 [2025-11-13 08:55:01,281][__main__][INFO] - agents played in iteration 146 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:55:01,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:01,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:01,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:01,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:01,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:01,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:02,558][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:02,855][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:03,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:04,158][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:04,488][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:04,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:05,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:05,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:06,122][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:06,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:06,774][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:07,097][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:07,754][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:08,406][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:08,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:09,054][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:09,379][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:09,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:10,032][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:10,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:11,006][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:11,332][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:12,310][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:12,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:13,652][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:14,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:14,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:14,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:15,362][__main__][INFO] - Iteration 147 took 21s (35.35% Gen, 60.18% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 17m 44s. Estimated total time: 18h 9m 10s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 18s, 500 more iterations: 3h 1m 31s. [2025-11-13 08:55:15,365][__main__][INFO] - Starting iteration 147. [2025-11-13 08:55:15,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:15,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:23,088][__main__][INFO] - Number of regex retries in iteration 147: 0 [2025-11-13 08:55:23,089][__main__][INFO] - agents played in iteration 147 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:55:23,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:23,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:23,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:23,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:23,637][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:23,638][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:24,357][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:24,654][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:26,282][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:26,607][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:26,931][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:27,580][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:27,905][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:28,229][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:28,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:28,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:29,525][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:29,849][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:30,174][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:30,825][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:31,149][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:32,120][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:33,092][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:33,419][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:33,742][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:34,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:34,393][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:34,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:35,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:36,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:36,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:36,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:37,119][__main__][INFO] - Iteration 148 took 21s (35.49% Gen, 59.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 47s. Estimated total time: 18h 7m 35s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 15s, 500 more iterations: 3h 1m 15s. [2025-11-13 08:55:37,121][__main__][INFO] - Starting iteration 148. [2025-11-13 08:55:37,124][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:37,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:45,017][__main__][INFO] - Number of regex retries in iteration 148: 0 [2025-11-13 08:55:45,018][__main__][INFO] - agents played in iteration 148 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:55:45,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:45,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:45,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:45,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:45,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:45,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:46,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:46,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:46,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:47,557][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:47,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:48,205][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:48,855][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:49,827][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:50,476][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:51,772][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:52,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:53,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:53,392][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:53,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:54,039][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:55,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:55,338][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:55,987][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:56,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:56,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:57,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:58,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:58,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:58,041][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:59,017][__main__][INFO] - Iteration 149 took 21s (36.05% Gen, 59.48% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 32s. Estimated total time: 18h 14m 42s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 29s, 500 more iterations: 3h 2m 27s. [2025-11-13 08:55:59,020][__main__][INFO] - Starting iteration 149. [2025-11-13 08:55:59,023][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:59,024][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:06,953][__main__][INFO] - Number of regex retries in iteration 149: 0 [2025-11-13 08:56:06,954][__main__][INFO] - agents played in iteration 149 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:56:07,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,499][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:07,500][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:08,226][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:08,523][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:09,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:10,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:10,471][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:10,796][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:12,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:12,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:12,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:13,064][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:13,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:14,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:14,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:15,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:15,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:15,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:17,605][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:18,252][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:18,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:19,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:20,028][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:20,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:20,032][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:20,993][__main__][INFO] - Iteration 150 took 21s (36.09% Gen, 59.53% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 26m 1s. Estimated total time: 18h 18m 33s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 37s, 500 more iterations: 3h 3m 5s. [2025-11-13 08:56:20,995][__main__][INFO] - Starting iteration 150. [2025-11-13 08:56:20,998][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:56:20,999][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:28,438][__main__][INFO] - Number of regex retries in iteration 150: 0 [2025-11-13 08:56:28,439][__main__][INFO] - agents played in iteration 150 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:56:28,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:28,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:28,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:28,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:28,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:28,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:29,723][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:30,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:30,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:31,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:31,331][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:31,982][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:32,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:32,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:32,956][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:33,280][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:33,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:33,930][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:34,255][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:34,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:35,229][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:35,553][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:35,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:36,204][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:37,181][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:37,507][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:37,831][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:38,158][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:38,484][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:39,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:39,456][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:39,781][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:40,105][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:40,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:41,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:41,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:41,504][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:43,396][__main__][INFO] - Iteration 151 took 22s (33.21% Gen, 58.33% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 47m 2s. Estimated total time: 18h 39m 56s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 19s, 500 more iterations: 3h 6m 39s. [2025-11-13 08:56:43,399][__main__][INFO] - Starting iteration 151. [2025-11-13 08:56:43,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:56:43,403][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:51,792][__main__][INFO] - Number of regex retries in iteration 151: 0 [2025-11-13 08:56:51,792][__main__][INFO] - agents played in iteration 151 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:56:52,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:52,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:52,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:52,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:52,330][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:52,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:53,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:53,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:53,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:54,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:54,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:55,298][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:55,628][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:55,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:56,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:56,619][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:56,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:57,278][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:57,927][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:58,253][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:58,901][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:59,226][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:59,875][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:00,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:00,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:00,854][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:01,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:01,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:02,478][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:03,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:04,118][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:04,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:04,839][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:04,840][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:05,812][__main__][INFO] - Iteration 152 took 22s (37.44% Gen, 58.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 47m 15s. Estimated total time: 18h 40m 31s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 21s, 500 more iterations: 3h 6m 45s. [2025-11-13 08:57:05,814][__main__][INFO] - Starting iteration 152. [2025-11-13 08:57:05,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:05,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:13,549][__main__][INFO] - Number of regex retries in iteration 152: 0 [2025-11-13 08:57:13,550][__main__][INFO] - agents played in iteration 152 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:57:13,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:14,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:14,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:15,097][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:15,423][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:16,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:16,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:17,051][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:17,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:17,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:18,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:18,350][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:18,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:19,001][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:19,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:19,652][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:19,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:20,300][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:21,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:21,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:22,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:22,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:23,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:23,560][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:23,885][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:24,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:24,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:25,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:25,883][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:26,611][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:26,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:26,614][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:27,597][__main__][INFO] - Iteration 153 took 21s (35.50% Gen, 59.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 23s. Estimated total time: 18h 9m 1s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 18s, 500 more iterations: 3h 1m 30s. [2025-11-13 08:57:27,599][__main__][INFO] - Starting iteration 153. [2025-11-13 08:57:27,602][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:27,603][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:35,264][__main__][INFO] - Number of regex retries in iteration 153: 0 [2025-11-13 08:57:35,265][__main__][INFO] - agents played in iteration 153 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:57:35,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:35,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:35,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:35,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:35,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:35,798][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:36,500][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:37,125][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:37,784][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:38,443][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:38,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:40,076][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:40,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:40,727][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:41,713][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:42,039][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:42,363][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:42,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:43,340][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:43,664][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:43,993][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:44,316][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:44,965][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:45,615][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:46,265][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:46,590][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:46,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:47,585][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:48,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:48,311][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:48,313][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:49,276][__main__][INFO] - Iteration 154 took 21s (35.35% Gen, 60.20% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 9m 44s. Estimated total time: 18h 3m 44s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 7s, 500 more iterations: 3h 0m 37s. [2025-11-13 08:57:49,278][__main__][INFO] - Starting iteration 154. [2025-11-13 08:57:49,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:49,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:57,084][__main__][INFO] - Number of regex retries in iteration 154: 0 [2025-11-13 08:57:57,085][__main__][INFO] - agents played in iteration 154 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:57:57,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:57,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:57,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:57,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:57,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:57,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:58,307][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:58,929][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:59,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:59,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:59,907][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:00,233][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:00,557][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:00,886][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:01,536][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:02,185][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:02,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:03,486][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:03,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:04,136][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:04,785][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:05,110][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:06,406][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:06,730][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:07,383][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:07,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:08,034][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:08,358][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:08,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:09,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:10,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:10,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:10,066][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:11,056][__main__][INFO] - Iteration 155 took 21s (35.83% Gen, 59.61% Train). Generation: 7s, Training: 12s. Estimated remaining time: 17h 14m 25s. Estimated total time: 18h 8m 47s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 17s, 500 more iterations: 3h 1m 27s. [2025-11-13 08:58:11,058][__main__][INFO] - Starting iteration 155. [2025-11-13 08:58:11,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:11,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:18,709][__main__][INFO] - Number of regex retries in iteration 155: 0 [2025-11-13 08:58:18,709][__main__][INFO] - agents played in iteration 155 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:58:19,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:19,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:19,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:19,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:19,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:19,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:19,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:20,228][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:20,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:21,204][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:21,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:21,855][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:22,830][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:23,155][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:23,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:24,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:24,452][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:26,072][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:26,401][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:26,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:27,049][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:27,373][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:27,703][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:28,026][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:28,350][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:28,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:29,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:29,325][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:30,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:30,952][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:31,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:31,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:31,678][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:32,624][__main__][INFO] - Iteration 156 took 21s (35.46% Gen, 60.14% Train). Generation: 7s, Training: 12s. Estimated remaining time: 17h 3m 26s. Estimated total time: 17h 58m 9s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 56s, 500 more iterations: 2h 59m 41s. [2025-11-13 08:58:32,626][__main__][INFO] - Starting iteration 156. [2025-11-13 08:58:32,630][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:32,630][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:39,754][__main__][INFO] - Number of regex retries in iteration 156: 0 [2025-11-13 08:58:39,755][__main__][INFO] - agents played in iteration 156 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:58:40,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:40,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:40,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:40,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:40,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:40,294][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:41,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:41,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:41,942][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:42,270][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:42,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:43,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:43,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:43,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:44,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:44,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:45,208][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:45,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:45,859][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:46,831][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:47,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:47,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:47,803][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:48,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:48,451][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:48,776][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:49,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:49,424][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:50,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:50,720][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:51,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:52,037][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:52,763][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:52,764][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:52,766][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:53,708][__main__][INFO] - Iteration 157 took 21s (33.80% Gen, 61.72% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 38m 53s. Estimated total time: 17h 33m 57s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 7s, 500 more iterations: 2h 55m 39s. [2025-11-13 08:58:53,710][__main__][INFO] - Starting iteration 157. [2025-11-13 08:58:53,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:53,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:00,761][__main__][INFO] - Number of regex retries in iteration 157: 0 [2025-11-13 08:59:00,761][__main__][INFO] - agents played in iteration 157 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:59:01,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:01,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:01,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:01,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:01,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:01,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:01,996][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:02,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:02,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:03,266][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:03,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:03,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:04,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:04,572][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:04,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:05,231][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:05,559][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:05,886][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:06,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:06,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:07,190][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:07,515][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:07,841][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:08,166][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:08,492][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:08,823][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:09,148][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:09,472][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:09,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:10,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:10,774][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:11,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:11,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:12,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:13,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:13,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:13,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:13,822][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:14,767][__main__][INFO] - Iteration 158 took 21s (33.47% Gen, 62.03% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 37m 18s. Estimated total time: 17h 32m 44s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 5s, 500 more iterations: 2h 55m 27s. [2025-11-13 08:59:14,769][__main__][INFO] - Starting iteration 158. [2025-11-13 08:59:14,773][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:14,774][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:21,695][__main__][INFO] - Number of regex retries in iteration 158: 0 [2025-11-13 08:59:21,696][__main__][INFO] - agents played in iteration 158 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:59:22,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:22,234][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:22,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:22,939][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:23,235][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:23,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:23,884][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:24,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:24,533][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:24,859][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:25,184][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:26,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:26,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:26,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:27,148][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:27,473][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:27,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:28,123][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:28,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:29,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:29,422][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:30,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:31,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:31,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:32,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:32,681][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:33,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:33,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:33,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:34,710][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:34,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:34,713][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:35,675][__main__][INFO] - Iteration 159 took 20s (33.11% Gen, 62.28% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 29m 22s. Estimated total time: 17h 25m 8s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 50s, 500 more iterations: 2h 54m 11s. [2025-11-13 08:59:35,677][__main__][INFO] - Starting iteration 159. [2025-11-13 08:59:35,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:35,682][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:42,369][__main__][INFO] - Number of regex retries in iteration 159: 0 [2025-11-13 08:59:42,370][__main__][INFO] - agents played in iteration 159 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 08:59:42,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:42,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:42,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:42,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:42,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:42,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:43,634][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:44,257][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:44,909][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:45,561][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:45,887][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:46,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:46,537][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:46,863][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:47,190][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:47,516][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:47,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:48,170][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:48,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:49,154][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:49,478][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:49,802][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:50,127][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:50,452][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:50,777][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:51,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:51,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:52,082][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:52,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:53,060][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:53,708][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:54,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:54,701][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:55,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:55,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:55,437][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:56,400][__main__][INFO] - Iteration 160 took 20s (32.28% Gen, 63.07% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 19m 52s. Estimated total time: 17h 15m 59s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 31s, 500 more iterations: 2h 52m 39s. [2025-11-13 08:59:56,403][__main__][INFO] - Starting iteration 160. [2025-11-13 08:59:56,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:56,406][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:02,876][__main__][INFO] - Number of regex retries in iteration 160: 0 [2025-11-13 09:00:02,876][__main__][INFO] - agents played in iteration 160 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:00:03,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:03,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:03,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:03,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:03,761][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:03,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:04,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:04,766][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:05,418][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:05,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:06,070][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:06,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:08,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:08,365][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:08,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:09,018][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:09,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:10,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:10,329][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:10,656][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:10,985][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:11,315][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:11,967][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:12,296][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:12,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:13,271][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:14,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:14,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:15,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:16,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:16,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:16,301][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:18,189][__main__][INFO] - Iteration 161 took 21s (29.70% Gen, 61.63% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 12m 44s. Estimated total time: 18h 9m 13s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 18s, 500 more iterations: 3h 1m 32s. [2025-11-13 09:00:18,191][__main__][INFO] - Starting iteration 161. [2025-11-13 09:00:18,195][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:18,195][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:25,814][__main__][INFO] - Number of regex retries in iteration 161: 0 [2025-11-13 09:00:25,815][__main__][INFO] - agents played in iteration 161 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:00:26,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:26,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:26,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:26,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:26,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:26,353][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:27,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:27,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:27,690][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:28,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:28,664][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:28,989][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:29,316][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:29,643][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:29,967][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:30,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:30,618][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:30,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:31,271][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:31,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:32,251][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:33,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:33,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:34,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:34,540][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:34,866][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:35,519][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:35,843][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:36,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:36,502][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:36,828][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:37,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:37,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:38,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:38,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:38,861][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:38,863][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:39,848][__main__][INFO] - Iteration 162 took 21s (35.19% Gen, 60.26% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 5m 51s. Estimated total time: 18h 2m 41s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 5s, 500 more iterations: 3h 0m 26s. [2025-11-13 09:00:39,850][__main__][INFO] - Starting iteration 162. [2025-11-13 09:00:39,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:39,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:46,622][__main__][INFO] - Number of regex retries in iteration 162: 0 [2025-11-13 09:00:46,622][__main__][INFO] - agents played in iteration 162 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:00:47,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:47,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:47,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:47,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:47,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:47,170][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:47,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:48,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:48,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:48,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:49,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:49,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:49,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:50,131][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:50,457][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:50,782][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:51,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:51,762][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:52,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:52,412][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:52,736][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:53,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:54,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:54,368][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:54,693][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:55,019][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:55,347][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:55,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:56,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:56,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:56,974][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:57,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:57,625][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:58,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:58,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:59,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:59,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:59,663][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:00,628][__main__][INFO] - Iteration 163 took 20s (32.58% Gen, 62.77% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 21m 32s. Estimated total time: 17h 18m 43s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 37s, 500 more iterations: 2h 53m 7s. [2025-11-13 09:01:00,630][__main__][INFO] - Starting iteration 163. [2025-11-13 09:01:00,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:00,634][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:07,829][__main__][INFO] - Number of regex retries in iteration 163: 0 [2025-11-13 09:01:07,830][__main__][INFO] - agents played in iteration 163 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:01:08,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:08,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:08,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:08,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:08,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:08,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:09,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:09,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:10,049][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:10,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:10,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:11,676][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:12,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:12,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:12,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:13,305][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:13,629][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:14,280][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:14,605][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:15,255][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:15,580][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:15,909][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:16,562][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:17,214][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:17,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:17,872][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:18,201][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:18,851][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:19,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:19,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:20,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:20,880][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:20,881][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:20,883][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:21,834][__main__][INFO] - Iteration 164 took 21s (33.94% Gen, 61.57% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 42m 32s. Estimated total time: 17h 40m 4s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 20s, 500 more iterations: 2h 56m 40s. [2025-11-13 09:01:21,836][__main__][INFO] - Starting iteration 164. [2025-11-13 09:01:21,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:21,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:28,779][__main__][INFO] - Number of regex retries in iteration 164: 0 [2025-11-13 09:01:28,779][__main__][INFO] - agents played in iteration 164 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:01:29,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:29,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:29,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:29,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:29,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:29,329][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:30,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:31,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:31,997][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:32,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:32,972][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:33,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:33,622][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:33,947][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:34,271][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:34,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:35,249][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:35,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:36,563][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:36,890][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:37,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:37,540][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:38,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:38,522][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:38,851][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:39,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:39,836][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:40,161][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:40,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:41,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:41,873][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:41,875][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:41,876][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:42,824][__main__][INFO] - Iteration 165 took 20s (33.07% Gen, 62.41% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 31m 22s. Estimated total time: 17h 29m 16s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 58s, 500 more iterations: 2h 54m 52s. [2025-11-13 09:01:42,826][__main__][INFO] - Starting iteration 165. [2025-11-13 09:01:42,830][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:42,830][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:49,948][__main__][INFO] - Number of regex retries in iteration 165: 0 [2025-11-13 09:01:49,949][__main__][INFO] - agents played in iteration 165 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:01:50,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:50,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:50,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:50,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:50,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:50,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:51,511][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:52,815][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:53,141][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:53,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:54,115][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:54,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:54,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:55,090][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:55,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:55,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:57,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:57,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:58,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:58,664][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:59,319][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:59,645][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:59,973][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:00,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:01,293][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:01,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:02,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:03,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:03,008][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:03,010][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:03,986][__main__][INFO] - Iteration 166 took 21s (33.65% Gen, 61.73% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 39m 37s. Estimated total time: 17h 37m 52s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 15s, 500 more iterations: 2h 56m 18s. [2025-11-13 09:02:03,988][__main__][INFO] - Starting iteration 166. [2025-11-13 09:02:03,991][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:03,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:11,085][__main__][INFO] - Number of regex retries in iteration 166: 0 [2025-11-13 09:02:11,086][__main__][INFO] - agents played in iteration 166 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:02:11,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:11,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:11,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:11,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:11,605][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:11,605][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:12,330][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:12,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:13,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:13,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:14,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:14,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:15,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:15,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:16,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:16,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:16,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:17,522][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:17,847][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:18,173][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:18,499][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:19,803][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:20,454][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:20,779][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:21,104][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:21,430][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:21,759][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:22,087][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:22,414][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:22,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:23,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:24,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:24,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:24,179][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:25,133][__main__][INFO] - Iteration 167 took 21s (33.55% Gen, 61.93% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 38m 32s. Estimated total time: 17h 37m 8s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 14s, 500 more iterations: 2h 56m 11s. [2025-11-13 09:02:25,135][__main__][INFO] - Starting iteration 167. [2025-11-13 09:02:25,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:25,139][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:32,265][__main__][INFO] - Number of regex retries in iteration 167: 0 [2025-11-13 09:02:32,265][__main__][INFO] - agents played in iteration 167 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:02:32,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:32,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:32,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:32,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:32,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:32,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:33,454][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:33,750][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:34,076][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:34,403][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:34,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:35,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:36,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:36,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:37,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:37,331][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:37,655][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:38,304][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:39,603][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:39,928][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:40,253][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:40,579][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:41,230][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:41,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:42,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:42,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:42,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:43,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:43,510][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:43,836][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:44,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:45,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:45,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:45,286][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:46,251][__main__][INFO] - Iteration 168 took 21s (33.75% Gen, 61.67% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 36m 44s. Estimated total time: 17h 35m 41s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 11s, 500 more iterations: 2h 55m 56s. [2025-11-13 09:02:46,253][__main__][INFO] - Starting iteration 168. [2025-11-13 09:02:46,257][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:46,257][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:53,362][__main__][INFO] - Number of regex retries in iteration 168: 0 [2025-11-13 09:02:53,363][__main__][INFO] - agents played in iteration 168 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:02:53,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:53,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:53,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:53,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:53,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:53,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:54,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:55,194][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:55,519][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:56,170][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:56,497][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:56,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:57,154][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:57,481][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:58,136][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:58,465][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:58,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:59,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:59,764][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:00,089][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:00,414][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:00,740][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:01,064][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:01,387][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:01,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:02,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:02,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:02,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:03,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:03,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:03,997][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:04,323][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:04,650][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:04,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:05,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:06,383][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:06,384][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:06,386][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:07,370][__main__][INFO] - Iteration 169 took 21s (33.65% Gen, 61.68% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 36m 25s. Estimated total time: 17h 35m 43s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 11s, 500 more iterations: 2h 55m 57s. [2025-11-13 09:03:07,373][__main__][INFO] - Starting iteration 169. [2025-11-13 09:03:07,376][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:07,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:14,310][__main__][INFO] - Number of regex retries in iteration 169: 0 [2025-11-13 09:03:14,310][__main__][INFO] - agents played in iteration 169 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:03:14,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:14,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:14,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:14,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:14,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:14,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:16,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:16,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:16,787][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:17,437][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:17,760][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:18,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:18,739][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:19,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:19,725][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:20,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:20,704][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:21,029][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:21,353][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:21,677][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:22,328][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:22,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:23,301][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:23,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:23,951][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:24,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:24,927][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:25,254][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:25,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:25,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:26,615][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:27,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:27,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:27,352][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:28,351][__main__][INFO] - Iteration 170 took 20s (33.06% Gen, 62.17% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 29m 10s. Estimated total time: 17h 28m 49s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 57s, 500 more iterations: 2h 54m 48s. [2025-11-13 09:03:28,353][__main__][INFO] - Starting iteration 170. [2025-11-13 09:03:28,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:28,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:35,308][__main__][INFO] - Number of regex retries in iteration 170: 0 [2025-11-13 09:03:35,309][__main__][INFO] - agents played in iteration 170 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:03:35,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:35,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:35,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:35,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:35,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:35,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:36,508][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:37,128][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:37,779][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:38,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:38,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:39,076][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:39,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:39,724][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:40,049][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:40,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:40,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:41,025][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:41,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:41,681][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:42,333][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:42,984][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:43,309][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:43,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:43,959][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:44,609][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:44,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:45,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:45,907][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:46,233][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:46,557][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:46,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:47,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:48,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:48,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:48,304][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:50,164][__main__][INFO] - Iteration 171 took 21s (31.88% Gen, 59.59% Train). Generation: 6s, Training: 12s. Estimated remaining time: 17h 10m 23s. Estimated total time: 18h 10m 24s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 20s, 500 more iterations: 3h 1m 44s. [2025-11-13 09:03:50,166][__main__][INFO] - Starting iteration 171. [2025-11-13 09:03:50,169][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:03:50,170][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:57,940][__main__][INFO] - Number of regex retries in iteration 171: 0 [2025-11-13 09:03:57,941][__main__][INFO] - agents played in iteration 171 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:03:58,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:58,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:58,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:58,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:58,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:58,460][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:59,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:59,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:00,091][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:00,416][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:00,741][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:01,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:01,389][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:02,037][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:02,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:02,687][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:03,339][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:03,663][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:04,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:04,966][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:05,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:05,941][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:06,590][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:06,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:07,241][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:07,565][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:07,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:08,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:08,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:09,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:10,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:10,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:10,962][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:10,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:11,936][__main__][INFO] - Iteration 172 took 21s (35.70% Gen, 59.84% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 8m 0s. Estimated total time: 18h 8m 22s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 16s, 500 more iterations: 3h 1m 23s. [2025-11-13 09:04:11,938][__main__][INFO] - Starting iteration 172. [2025-11-13 09:04:11,940][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:11,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:19,341][__main__][INFO] - Number of regex retries in iteration 172: 0 [2025-11-13 09:04:19,342][__main__][INFO] - agents played in iteration 172 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:04:19,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:19,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:19,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:19,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:19,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:19,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:20,840][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:21,165][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:21,489][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:21,814][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:22,138][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:22,463][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:22,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:23,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:23,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:24,085][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:24,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:25,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:26,036][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:26,692][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:27,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:27,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:29,301][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:29,625][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:29,950][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:30,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:30,599][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:30,923][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:31,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:32,337][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:32,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:32,341][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:33,281][__main__][INFO] - Iteration 173 took 21s (34.67% Gen, 60.91% Train). Generation: 7s, Training: 12s. Estimated remaining time: 16h 46m 19s. Estimated total time: 17h 47m 3s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 34s, 500 more iterations: 2h 57m 50s. [2025-11-13 09:04:33,283][__main__][INFO] - Starting iteration 173. [2025-11-13 09:04:33,286][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:33,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:40,724][__main__][INFO] - Number of regex retries in iteration 173: 0 [2025-11-13 09:04:40,725][__main__][INFO] - agents played in iteration 173 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:04:41,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:41,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:41,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:41,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:41,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:41,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:41,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:42,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:42,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:42,863][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:43,837][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:44,821][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:45,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:45,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:46,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:46,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:47,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:47,421][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:48,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:48,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:48,728][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:49,053][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:49,381][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:49,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:50,031][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:50,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:51,005][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:51,330][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:51,655][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:51,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:52,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:53,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:53,703][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:53,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:53,707][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:54,650][__main__][INFO] - Iteration 174 took 21s (34.81% Gen, 60.76% Train). Generation: 7s, Training: 12s. Estimated remaining time: 16h 47m 9s. Estimated total time: 17h 48m 15s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 36s, 500 more iterations: 2h 58m 2s. [2025-11-13 09:04:54,653][__main__][INFO] - Starting iteration 174. [2025-11-13 09:04:54,656][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:54,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:01,815][__main__][INFO] - Number of regex retries in iteration 174: 0 [2025-11-13 09:05:01,816][__main__][INFO] - agents played in iteration 174 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:05:02,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:02,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:02,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:02,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:02,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:02,410][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:03,081][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:03,376][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:03,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:04,349][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:04,677][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:05,325][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:05,650][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:06,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:06,945][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:07,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:07,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:08,565][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:09,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:09,535][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:09,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:10,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:10,513][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:11,825][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:12,475][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:12,800][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:13,124][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:13,449][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:14,152][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:14,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:14,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:14,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:15,736][__main__][INFO] - Iteration 175 took 21s (33.96% Gen, 61.85% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 32m 36s. Estimated total time: 17h 34m 2s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 8s, 500 more iterations: 2h 55m 40s. [2025-11-13 09:05:15,738][__main__][INFO] - Starting iteration 175. [2025-11-13 09:05:15,741][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:15,741][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:22,818][__main__][INFO] - Number of regex retries in iteration 175: 0 [2025-11-13 09:05:22,819][__main__][INFO] - agents played in iteration 175 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:05:23,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:23,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:23,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:23,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:23,341][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:23,341][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:24,013][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:24,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:24,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:24,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:25,286][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:25,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:25,934][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:26,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:26,589][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:27,569][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:27,898][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:28,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:28,873][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:29,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:29,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:29,848][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:30,173][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:30,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:31,144][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:31,468][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:31,794][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:32,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:32,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:33,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:33,424][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:33,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:34,080][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:34,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:35,121][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:35,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:35,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:35,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:36,692][__main__][INFO] - Iteration 176 took 20s (33.78% Gen, 62.04% Train). Generation: 7s, Training: 12s. Estimated remaining time: 16h 25m 49s. Estimated total time: 17h 27m 36s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 55s, 500 more iterations: 2h 54m 36s. [2025-11-13 09:05:36,694][__main__][INFO] - Starting iteration 176. [2025-11-13 09:05:36,697][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:36,697][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:44,599][__main__][INFO] - Number of regex retries in iteration 176: 0 [2025-11-13 09:05:44,600][__main__][INFO] - agents played in iteration 176 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:05:45,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:45,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:45,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:45,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:45,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:45,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:45,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:46,096][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:47,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:47,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:48,385][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:49,694][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:50,019][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:50,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:50,669][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:51,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:51,975][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:52,299][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:52,622][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:53,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:53,925][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:54,253][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:54,578][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:54,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:55,562][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:55,887][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:56,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:56,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:57,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:57,619][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:57,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:58,494][__main__][INFO] - Iteration 177 took 21s (36.25% Gen, 59.73% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 44s. Estimated total time: 18h 9m 53s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 19s, 500 more iterations: 3h 1m 38s. [2025-11-13 09:05:58,496][__main__][INFO] - Starting iteration 177. [2025-11-13 09:05:58,498][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:58,499][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:06,644][__main__][INFO] - Number of regex retries in iteration 177: 0 [2025-11-13 09:06:06,645][__main__][INFO] - agents played in iteration 177 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:06:07,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:07,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:07,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:07,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:07,168][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:07,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:07,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:08,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:08,454][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:10,094][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:11,072][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:11,396][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:12,046][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:12,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:12,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:13,018][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:13,343][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:13,992][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:14,316][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:14,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:14,965][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:15,291][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:15,615][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:16,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:16,589][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:16,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:17,241][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:17,893][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:18,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:18,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:19,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:19,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:19,629][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:20,515][__main__][INFO] - Iteration 178 took 22s (37.00% Gen, 58.97% Train). Generation: 8s, Training: 12s. Estimated remaining time: 17h 18m 20s. Estimated total time: 18h 20m 52s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 41s, 500 more iterations: 3h 3m 28s. [2025-11-13 09:06:20,517][__main__][INFO] - Starting iteration 178. [2025-11-13 09:06:20,519][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:20,520][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:28,692][__main__][INFO] - Number of regex retries in iteration 178: 0 [2025-11-13 09:06:28,692][__main__][INFO] - agents played in iteration 178 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:06:29,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:29,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:29,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:29,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:29,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:29,217][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:30,828][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:31,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:31,478][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:32,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:33,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:33,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:33,744][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:34,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:34,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:34,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:35,040][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:35,363][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:35,686][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:36,010][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:36,335][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:36,983][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:37,306][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:37,632][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:37,955][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:38,927][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:39,252][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:39,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:40,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:40,943][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:41,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:41,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:41,630][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:42,504][__main__][INFO] - Iteration 179 took 21s (37.17% Gen, 58.85% Train). Generation: 8s, Training: 12s. Estimated remaining time: 17h 16m 25s. Estimated total time: 18h 19m 18s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 38s, 500 more iterations: 3h 3m 13s. [2025-11-13 09:06:42,506][__main__][INFO] - Starting iteration 179. [2025-11-13 09:06:42,509][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:42,509][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:50,551][__main__][INFO] - Number of regex retries in iteration 179: 0 [2025-11-13 09:06:50,552][__main__][INFO] - agents played in iteration 179 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:06:50,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:51,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:51,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:51,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:51,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:51,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:52,076][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:52,400][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:52,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:53,049][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:53,373][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:54,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:55,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:55,645][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:55,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:56,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:56,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:57,587][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:57,912][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:58,236][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:58,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:58,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:59,207][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:59,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:59,854][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:00,178][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:00,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:00,827][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:01,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:01,800][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:02,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:02,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:03,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:03,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:03,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:04,471][__main__][INFO] - Iteration 180 took 21s (36.62% Gen, 59.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 14m 54s. Estimated total time: 18h 18m 9s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 36s, 500 more iterations: 3h 3m 1s. [2025-11-13 09:07:04,473][__main__][INFO] - Starting iteration 180. [2025-11-13 09:07:04,476][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:04,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:12,814][__main__][INFO] - Number of regex retries in iteration 180: 0 [2025-11-13 09:07:12,815][__main__][INFO] - agents played in iteration 180 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:07:13,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:13,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:14,696][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:15,020][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:15,344][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:15,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:16,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:17,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:17,630][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:18,279][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:18,608][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:18,936][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:19,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:19,917][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:20,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:20,901][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:21,226][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:21,550][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:21,876][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:22,523][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:22,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:23,171][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:23,494][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:23,818][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:24,143][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:24,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:25,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:25,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:25,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:25,928][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:27,737][__main__][INFO] - Iteration 181 took 23s (35.84% Gen, 56.37% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 26s. Estimated total time: 19h 23m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 50s. [2025-11-13 09:07:27,739][__main__][INFO] - Starting iteration 181. [2025-11-13 09:07:27,741][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:07:27,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:36,434][__main__][INFO] - Number of regex retries in iteration 181: 0 [2025-11-13 09:07:36,435][__main__][INFO] - agents played in iteration 181 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:07:36,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:36,968][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:36,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:37,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:38,605][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:38,928][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:39,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:39,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:40,227][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:40,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:40,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:41,853][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:42,506][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:43,159][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:43,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:43,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:44,135][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:44,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:45,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:45,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:46,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:46,733][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:47,063][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:47,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:47,718][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:48,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:48,766][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:49,489][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:49,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:49,492][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:50,436][__main__][INFO] - Iteration 182 took 22s (38.30% Gen, 57.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 50m 44s. Estimated total time: 18h 54m 45s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 7s. [2025-11-13 09:07:50,437][__main__][INFO] - Starting iteration 182. [2025-11-13 09:07:50,440][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:07:50,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:59,319][__main__][INFO] - Number of regex retries in iteration 182: 0 [2025-11-13 09:07:59,320][__main__][INFO] - agents played in iteration 182 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:07:59,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:59,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:00,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:01,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:01,564][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:01,890][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:02,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:02,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:02,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:03,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:03,522][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:04,173][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:04,499][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:04,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:05,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:05,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:05,803][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:06,127][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:06,778][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:07,105][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:07,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:08,405][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:08,730][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:09,057][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:09,381][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:09,707][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:10,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:10,359][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:10,683][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:11,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:11,749][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:12,485][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:12,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:12,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:13,459][__main__][INFO] - Iteration 183 took 23s (38.57% Gen, 57.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 6m 34s. Estimated total time: 19h 10m 58s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 49s. [2025-11-13 09:08:13,461][__main__][INFO] - Starting iteration 183. [2025-11-13 09:08:13,464][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:13,464][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:22,303][__main__][INFO] - Number of regex retries in iteration 183: 0 [2025-11-13 09:08:22,304][__main__][INFO] - agents played in iteration 183 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:08:22,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,850][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:22,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:24,195][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:24,519][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:24,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:25,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:25,495][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:25,821][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:26,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:26,796][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:27,444][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:27,769][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:28,093][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:28,741][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:29,065][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:29,389][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:29,714][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:30,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:30,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:31,338][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:31,664][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:32,313][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:32,638][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:32,963][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:33,290][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:33,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:34,671][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:35,390][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:35,392][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:35,393][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:36,311][__main__][INFO] - Iteration 184 took 22s (38.69% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 57m 37s. Estimated total time: 19h 2m 24s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 4s, 500 more iterations: 3h 10m 24s. [2025-11-13 09:08:36,313][__main__][INFO] - Starting iteration 184. [2025-11-13 09:08:36,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:36,316][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:45,227][__main__][INFO] - Number of regex retries in iteration 184: 0 [2025-11-13 09:08:45,227][__main__][INFO] - agents played in iteration 184 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:08:45,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:45,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:46,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:47,108][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:47,755][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:48,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:48,403][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:48,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:49,052][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:49,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:49,701][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:50,024][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:50,348][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:50,672][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:50,997][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:51,322][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:51,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:52,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:52,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:53,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:53,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:54,243][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:54,891][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:55,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:56,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:56,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:57,570][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:58,301][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:58,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:58,303][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:59,223][__main__][INFO] - Iteration 185 took 22s (38.90% Gen, 57.08% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 16s. Estimated total time: 19h 5m 26s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 54s. [2025-11-13 09:08:59,225][__main__][INFO] - Starting iteration 185. [2025-11-13 09:08:59,228][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:59,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:08,154][__main__][INFO] - Number of regex retries in iteration 185: 0 [2025-11-13 09:09:08,155][__main__][INFO] - agents played in iteration 185 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:09:08,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:08,704][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:09,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:09,732][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:10,057][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:10,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:10,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:11,356][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:11,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:12,337][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:12,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:12,987][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:13,637][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:13,962][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:14,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:14,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:15,254][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:15,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:15,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:16,227][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:16,551][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:16,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:17,199][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:17,847][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:18,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:18,821][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:19,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:19,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:20,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:21,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:21,258][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:21,260][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:22,210][__main__][INFO] - Iteration 186 took 22s (38.84% Gen, 57.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 36s. Estimated total time: 19h 9m 9s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 31s. [2025-11-13 09:09:22,212][__main__][INFO] - Starting iteration 186. [2025-11-13 09:09:22,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:22,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:30,766][__main__][INFO] - Number of regex retries in iteration 186: 0 [2025-11-13 09:09:30,767][__main__][INFO] - agents played in iteration 186 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:09:31,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:31,313][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:31,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:32,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:33,304][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:33,634][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:33,958][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:34,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:34,933][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:35,910][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:36,885][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:37,536][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:37,861][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:38,510][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:38,835][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:39,158][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:39,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:40,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:40,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:40,787][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:41,437][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:41,763][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:42,087][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:42,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:43,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:43,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:43,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:43,869][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:44,812][__main__][INFO] - Iteration 187 took 22s (37.84% Gen, 57.98% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 0s. Estimated total time: 18h 49m 56s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 39s, 500 more iterations: 3h 8m 19s. [2025-11-13 09:09:44,814][__main__][INFO] - Starting iteration 187. [2025-11-13 09:09:44,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:44,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:53,664][__main__][INFO] - Number of regex retries in iteration 187: 0 [2025-11-13 09:09:53,665][__main__][INFO] - agents played in iteration 187 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:09:54,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:54,219][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:54,220][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:54,950][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:55,247][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:55,576][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:56,228][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:56,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:56,882][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:57,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:57,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:58,180][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:58,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:59,801][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:00,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:01,103][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:01,427][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:01,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:02,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:02,402][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:02,729][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:03,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:03,382][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:03,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:04,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:04,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:05,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:06,051][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:06,786][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:06,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:06,789][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:07,799][__main__][INFO] - Iteration 188 took 22s (38.49% Gen, 57.11% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 2m 50s. Estimated total time: 19h 9m 9s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 31s. [2025-11-13 09:10:07,801][__main__][INFO] - Starting iteration 188. [2025-11-13 09:10:07,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:07,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:16,625][__main__][INFO] - Number of regex retries in iteration 188: 0 [2025-11-13 09:10:16,626][__main__][INFO] - agents played in iteration 188 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:10:17,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:17,180][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:17,897][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:18,191][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:19,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:21,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:21,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:22,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:22,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:22,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:23,093][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:23,419][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:23,743][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:24,073][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:24,732][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:25,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:25,386][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:25,716][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:27,014][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:27,339][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:27,664][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:27,991][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:28,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:29,032][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:29,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:29,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:29,767][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:30,702][__main__][INFO] - Iteration 189 took 22s (38.53% Gen, 57.39% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 58m 15s. Estimated total time: 19h 4m 56s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 49s. [2025-11-13 09:10:30,703][__main__][INFO] - Starting iteration 189. [2025-11-13 09:10:30,706][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:30,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:39,604][__main__][INFO] - Number of regex retries in iteration 189: 0 [2025-11-13 09:10:39,605][__main__][INFO] - agents played in iteration 189 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:10:40,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,164][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:40,165][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:41,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:41,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:42,826][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:43,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:43,481][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:43,808][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:44,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:44,781][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:45,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:45,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:46,079][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:46,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:47,053][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:47,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:47,707][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:48,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:48,681][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:49,007][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:49,333][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:49,660][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:49,984][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:50,311][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:50,969][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:51,294][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:52,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:52,788][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:52,790][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:52,792][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:53,769][__main__][INFO] - Iteration 190 took 23s (38.58% Gen, 57.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 6m 8s. Estimated total time: 19h 13m 12s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 12s. [2025-11-13 09:10:53,771][__main__][INFO] - Starting iteration 190. [2025-11-13 09:10:53,774][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:53,775][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:02,770][__main__][INFO] - Number of regex retries in iteration 190: 0 [2025-11-13 09:11:02,771][__main__][INFO] - agents played in iteration 190 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:11:03,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,310][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:03,310][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:04,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:04,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:04,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:05,650][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:06,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:06,631][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:06,960][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:07,934][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:08,582][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:08,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:09,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:09,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:10,204][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:10,529][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:11,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:12,152][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:12,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:13,125][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:13,450][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:13,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:14,099][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:14,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:15,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:15,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:15,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:15,928][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:17,707][__main__][INFO] - Iteration 191 took 23s (37.58% Gen, 54.98% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 49m 10s. Estimated total time: 19h 56m 39s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 53s, 500 more iterations: 3h 19m 26s. [2025-11-13 09:11:17,708][__main__][INFO] - Starting iteration 191. [2025-11-13 09:11:17,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:11:17,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:26,514][__main__][INFO] - Number of regex retries in iteration 191: 0 [2025-11-13 09:11:26,515][__main__][INFO] - agents played in iteration 191 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:11:26,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:27,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:27,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:27,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:27,070][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:27,071][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:27,743][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:28,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:28,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:28,685][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:29,009][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:29,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:29,982][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:30,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:31,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:31,929][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:32,576][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:33,225][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:33,551][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:33,876][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:34,202][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:34,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:34,855][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:35,502][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:35,825][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:36,150][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:36,475][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:36,800][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:37,123][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:37,449][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:37,774][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:38,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:38,819][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:39,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:39,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:39,549][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:40,602][__main__][INFO] - Iteration 192 took 22s (38.45% Gen, 56.94% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 56m 44s. Estimated total time: 19h 4m 35s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 45s. [2025-11-13 09:11:40,604][__main__][INFO] - Starting iteration 192. [2025-11-13 09:11:40,608][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:11:40,608][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:48,967][__main__][INFO] - Number of regex retries in iteration 192: 0 [2025-11-13 09:11:48,968][__main__][INFO] - agents played in iteration 192 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:11:49,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:49,503][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:49,503][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:50,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:50,530][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:50,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:51,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:51,518][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:51,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:52,173][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:52,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:52,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:53,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:53,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:53,796][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:54,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:54,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:54,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:55,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:55,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:56,063][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:57,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:57,363][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:57,687][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:58,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:58,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:58,980][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:59,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:59,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:59,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:00,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:00,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:01,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:02,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:02,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:02,018][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:03,030][__main__][INFO] - Iteration 193 took 22s (37.28% Gen, 58.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 32m 55s. Estimated total time: 18h 41m 9s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 22s, 500 more iterations: 3h 6m 51s. [2025-11-13 09:12:03,032][__main__][INFO] - Starting iteration 193. [2025-11-13 09:12:03,035][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:03,036][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:12,291][__main__][INFO] - Number of regex retries in iteration 193: 0 [2025-11-13 09:12:12,291][__main__][INFO] - agents played in iteration 193 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:12:12,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:12,847][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:12,847][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:13,584][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:13,880][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:14,206][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:14,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:14,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:15,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:15,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:16,158][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:16,808][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:17,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:17,782][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:18,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:18,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:18,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:19,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:19,410][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:19,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:20,059][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:20,384][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:20,707][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:21,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:21,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:21,688][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:22,012][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:22,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:22,985][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:23,310][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:23,635][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:23,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:24,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:25,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:25,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:25,403][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:26,373][__main__][INFO] - Iteration 194 took 23s (39.66% Gen, 56.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 18s. Estimated total time: 19h 26m 55s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 29s. [2025-11-13 09:12:26,375][__main__][INFO] - Starting iteration 194. [2025-11-13 09:12:26,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:26,379][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:35,228][__main__][INFO] - Number of regex retries in iteration 194: 0 [2025-11-13 09:12:35,228][__main__][INFO] - agents played in iteration 194 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:12:35,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:35,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:35,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:36,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:37,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:37,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:38,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:38,444][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:38,768][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:39,740][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:40,064][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:40,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:41,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:41,365][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:42,015][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:42,340][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:42,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:42,992][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:43,317][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:43,968][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:44,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:44,619][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:44,943][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:45,591][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:46,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:46,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:47,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:48,330][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:48,332][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:48,333][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:49,575][__main__][INFO] - Iteration 195 took 23s (38.15% Gen, 56.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 10m 51s. Estimated total time: 19h 19m 51s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 18s. [2025-11-13 09:12:49,577][__main__][INFO] - Starting iteration 195. [2025-11-13 09:12:49,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:49,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:58,753][__main__][INFO] - Number of regex retries in iteration 195: 0 [2025-11-13 09:12:58,754][__main__][INFO] - agents played in iteration 195 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:12:59,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,317][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:59,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:00,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:00,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:00,668][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:00,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:01,641][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:02,612][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:03,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:03,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:04,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:04,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:05,204][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:05,528][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:06,176][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:06,500][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:07,147][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:07,794][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:08,120][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:08,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:09,416][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:09,739][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:10,391][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:11,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:11,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:11,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:11,862][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:12,812][__main__][INFO] - Iteration 196 took 23s (39.49% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 14s. Estimated total time: 19h 21m 37s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 36s. [2025-11-13 09:13:12,814][__main__][INFO] - Starting iteration 196. [2025-11-13 09:13:12,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:12,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:21,985][__main__][INFO] - Number of regex retries in iteration 196: 0 [2025-11-13 09:13:21,986][__main__][INFO] - agents played in iteration 196 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:13:22,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,530][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:22,531][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:23,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:23,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:24,192][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:24,517][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:24,843][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:25,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:25,492][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:25,816][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:26,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:26,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:26,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:27,115][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:27,441][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:27,766][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:28,091][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:28,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:28,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:29,397][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:29,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:30,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:30,371][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:30,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:31,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:31,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:32,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:32,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:32,965][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:33,290][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:33,615][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:34,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:35,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:35,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:35,062][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:36,011][__main__][INFO] - Iteration 197 took 23s (39.53% Gen, 56.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 57s. Estimated total time: 19h 19m 43s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 17s. [2025-11-13 09:13:36,013][__main__][INFO] - Starting iteration 197. [2025-11-13 09:13:36,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:36,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:45,105][__main__][INFO] - Number of regex retries in iteration 197: 0 [2025-11-13 09:13:45,105][__main__][INFO] - agents played in iteration 197 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:13:45,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,654][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:45,655][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:46,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:46,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:46,991][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:47,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:47,652][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:48,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:48,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:49,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:49,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:49,947][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:50,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:50,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:50,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:51,585][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:51,914][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:52,892][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:54,194][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:54,848][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:55,500][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:55,827][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:56,153][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:56,478][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:56,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:57,517][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:58,263][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:58,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:58,266][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:59,207][__main__][INFO] - Iteration 198 took 23s (39.19% Gen, 56.75% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 28s. Estimated total time: 19h 19m 38s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 16s. [2025-11-13 09:13:59,210][__main__][INFO] - Starting iteration 198. [2025-11-13 09:13:59,213][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:59,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:07,950][__main__][INFO] - Number of regex retries in iteration 198: 0 [2025-11-13 09:14:07,951][__main__][INFO] - agents played in iteration 198 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:14:08,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,498][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:08,499][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:09,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:09,530][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:09,855][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:10,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:10,504][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:11,159][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:11,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:12,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:12,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:12,788][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:13,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:13,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:13,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:14,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:14,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:14,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:15,077][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:15,405][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:16,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:16,379][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:16,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:17,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:17,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:17,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:18,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:18,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:19,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:19,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:20,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:21,097][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:21,098][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:21,101][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:22,090][__main__][INFO] - Iteration 199 took 22s (38.19% Gen, 57.48% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 53m 21s. Estimated total time: 19h 3m 54s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 39s. [2025-11-13 09:14:22,092][__main__][INFO] - Starting iteration 199. [2025-11-13 09:14:22,095][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:22,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:31,251][__main__][INFO] - Number of regex retries in iteration 199: 0 [2025-11-13 09:14:31,252][__main__][INFO] - agents played in iteration 199 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:14:31,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:31,806][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:32,834][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:33,484][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:34,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:34,457][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:34,782][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:35,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:35,432][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:35,758][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:36,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:36,406][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:36,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:37,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:37,708][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:38,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:38,361][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:38,685][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:39,009][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:39,659][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:39,984][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:40,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:40,956][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:41,935][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:42,262][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:42,587][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:42,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:43,646][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:44,393][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:44,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:44,396][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:45,358][__main__][INFO] - Iteration 200 took 23s (39.35% Gen, 56.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 14s. Estimated total time: 19h 23m 10s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 51s. [2025-11-13 09:14:45,360][__main__][INFO] - Starting iteration 200. [2025-11-13 09:14:45,363][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:45,364][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:54,504][__main__][INFO] - Number of regex retries in iteration 200: 0 [2025-11-13 09:14:54,504][__main__][INFO] - agents played in iteration 200 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:14:54,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:54,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:55,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:55,781][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:56,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:57,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:57,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:57,712][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:58,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:59,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:59,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:00,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:00,639][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:01,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:01,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:01,945][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:02,268][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:02,594][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:04,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:04,562][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:04,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:05,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:05,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:06,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:06,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:07,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:07,662][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:07,663][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:09,521][__main__][INFO] - Iteration 201 took 24s (37.83% Gen, 54.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 56m 37s. Estimated total time: 20h 7m 57s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 19s. [2025-11-13 09:15:09,523][__main__][INFO] - Starting iteration 201. [2025-11-13 09:15:09,526][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:09,527][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:18,556][__main__][INFO] - Number of regex retries in iteration 201: 0 [2025-11-13 09:15:18,557][__main__][INFO] - agents played in iteration 201 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:15:18,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:19,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:19,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:19,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:19,094][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:19,095][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:20,063][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:20,388][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:20,712][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:21,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:21,691][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:22,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:22,993][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:23,318][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:23,642][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:23,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:24,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:24,615][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:24,945][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:25,272][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:25,602][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:25,929][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:26,258][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:26,583][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:26,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:27,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:27,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:28,885][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:30,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:30,929][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:31,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:31,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:31,662][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:32,658][__main__][INFO] - Iteration 202 took 23s (39.03% Gen, 56.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 55s. Estimated total time: 19h 16m 38s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 46s. [2025-11-13 09:15:32,661][__main__][INFO] - Starting iteration 202. [2025-11-13 09:15:32,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:32,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:41,888][__main__][INFO] - Number of regex retries in iteration 202: 0 [2025-11-13 09:15:41,888][__main__][INFO] - agents played in iteration 202 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:15:42,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:42,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:43,173][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:43,469][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:44,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:44,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:45,095][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:45,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:45,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:46,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:46,394][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:47,045][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:47,371][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:47,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:48,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:48,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:48,667][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:48,990][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:49,315][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:49,638][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:50,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:50,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:51,263][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:51,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:51,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:52,236][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:52,884][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:53,208][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:53,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:54,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:54,980][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:54,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:54,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:55,974][__main__][INFO] - Iteration 203 took 23s (39.57% Gen, 56.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 28s. Estimated total time: 19h 25m 35s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 15s. [2025-11-13 09:15:55,976][__main__][INFO] - Starting iteration 203. [2025-11-13 09:15:55,980][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:55,980][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:05,512][__main__][INFO] - Number of regex retries in iteration 203: 0 [2025-11-13 09:16:05,512][__main__][INFO] - agents played in iteration 203 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:16:05,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:05,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:06,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:06,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:07,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:07,412][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:07,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:08,063][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:08,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:08,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:10,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:10,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:11,316][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:11,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:11,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:12,288][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:12,938][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:13,587][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:13,911][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:14,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:15,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:16,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:17,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:17,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:18,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:18,616][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:18,617][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:19,556][__main__][INFO] - Iteration 204 took 23s (40.43% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 21s. Estimated total time: 19h 38m 51s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 28s. [2025-11-13 09:16:19,558][__main__][INFO] - Starting iteration 204. [2025-11-13 09:16:19,561][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:19,561][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:29,082][__main__][INFO] - Number of regex retries in iteration 204: 0 [2025-11-13 09:16:29,083][__main__][INFO] - agents played in iteration 204 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:16:29,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:29,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:30,344][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:30,639][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:30,968][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:31,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:31,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:31,941][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:32,265][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:32,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:32,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:33,244][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:33,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:34,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:34,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:35,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:35,514][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:35,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:36,492][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:36,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:37,471][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:37,797][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:38,121][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:38,446][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:38,775][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:39,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:39,758][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:40,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:40,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:41,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:42,202][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:42,204][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:42,206][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:43,157][__main__][INFO] - Iteration 205 took 23s (40.35% Gen, 55.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 56s. Estimated total time: 19h 39m 50s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 38s. [2025-11-13 09:16:43,159][__main__][INFO] - Starting iteration 205. [2025-11-13 09:16:43,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:43,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:52,523][__main__][INFO] - Number of regex retries in iteration 205: 0 [2025-11-13 09:16:52,523][__main__][INFO] - agents played in iteration 205 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:16:52,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:53,076][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:53,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:54,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:54,743][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:55,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:55,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:55,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:56,046][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:56,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:56,697][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:57,022][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:58,324][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:58,971][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:59,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:59,627][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:00,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:00,927][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:01,251][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:01,579][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:01,902][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:02,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:02,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:03,206][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:03,857][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:04,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:04,900][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:05,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:05,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:05,647][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:06,631][__main__][INFO] - Iteration 206 took 23s (39.89% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 12s. Estimated total time: 19h 33m 30s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 35s. [2025-11-13 09:17:06,633][__main__][INFO] - Starting iteration 206. [2025-11-13 09:17:06,636][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:06,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:16,140][__main__][INFO] - Number of regex retries in iteration 206: 0 [2025-11-13 09:17:16,140][__main__][INFO] - agents played in iteration 206 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:17:16,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,686][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:16,687][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:17,423][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:18,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:18,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:19,027][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:19,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:20,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:20,656][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:21,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:21,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:21,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:22,291][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:22,615][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:22,940][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:23,591][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:24,246][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:24,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:24,896][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:25,221][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:26,200][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:26,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:26,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:27,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:27,505][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:27,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:28,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:29,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:29,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:29,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:30,289][__main__][INFO] - Iteration 207 took 23s (40.17% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 0s. Estimated total time: 19h 42m 42s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 7s. [2025-11-13 09:17:30,291][__main__][INFO] - Starting iteration 207. [2025-11-13 09:17:30,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:30,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:39,639][__main__][INFO] - Number of regex retries in iteration 207: 0 [2025-11-13 09:17:39,640][__main__][INFO] - agents played in iteration 207 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:17:40,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,188][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:40,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:40,918][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:41,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:41,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:41,870][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:42,195][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:42,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:42,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:43,173][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:43,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:43,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:44,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:44,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:44,803][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:45,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:45,452][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:45,777][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:46,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:46,426][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:46,752][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:47,079][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:47,407][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:49,026][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:49,350][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:49,674][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:49,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:50,323][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:50,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:51,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:52,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:52,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:52,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:52,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:53,723][__main__][INFO] - Iteration 208 took 23s (39.88% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 24s. Estimated total time: 19h 31m 29s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 14s. [2025-11-13 09:17:53,726][__main__][INFO] - Starting iteration 208. [2025-11-13 09:17:53,729][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:53,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:02,948][__main__][INFO] - Number of regex retries in iteration 208: 0 [2025-11-13 09:18:02,949][__main__][INFO] - agents played in iteration 208 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:18:03,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:03,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:04,219][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:04,515][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:04,842][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:05,172][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:05,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:05,825][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:06,153][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:06,804][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:07,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:07,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:07,779][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:08,432][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:08,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:09,085][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:09,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:10,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:11,044][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:11,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:12,022][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:13,005][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:13,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:13,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:14,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:14,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:15,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:16,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:16,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:16,080][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:17,056][__main__][INFO] - Iteration 209 took 23s (39.52% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 11m 55s. Estimated total time: 19h 26m 23s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 23s. [2025-11-13 09:18:17,058][__main__][INFO] - Starting iteration 209. [2025-11-13 09:18:17,062][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:17,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:26,148][__main__][INFO] - Number of regex retries in iteration 209: 0 [2025-11-13 09:18:26,149][__main__][INFO] - agents played in iteration 209 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:18:26,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:26,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:26,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:26,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:26,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:26,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:27,428][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:27,725][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:28,053][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:28,379][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:28,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:29,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:29,355][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:30,003][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:30,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:30,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:30,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:31,301][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:31,950][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:32,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:32,598][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:33,572][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:33,899][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:34,223][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:34,548][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:34,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:35,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:35,525][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:35,850][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:36,175][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:36,499][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:37,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:37,472][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:37,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:38,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:39,224][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:39,226][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:39,227][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:40,213][__main__][INFO] - Iteration 210 took 23s (39.25% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 45s. Estimated total time: 19h 17m 36s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 56s. [2025-11-13 09:18:40,216][__main__][INFO] - Starting iteration 210. [2025-11-13 09:18:40,220][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:40,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:49,661][__main__][INFO] - Number of regex retries in iteration 210: 0 [2025-11-13 09:18:49,662][__main__][INFO] - agents played in iteration 210 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:18:50,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:50,209][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:50,951][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:51,248][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:51,573][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:52,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:53,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:53,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:54,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:55,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:55,477][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:55,800][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:56,125][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:56,449][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:57,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:57,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:57,749][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:58,400][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:58,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:59,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:59,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:00,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:00,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:00,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:01,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:02,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:02,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:02,813][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:02,815][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:04,663][__main__][INFO] - Iteration 211 took 24s (38.62% Gen, 53.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 6m 58s. Estimated total time: 20h 22m 13s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 44s, 500 more iterations: 3h 23m 42s. [2025-11-13 09:19:04,666][__main__][INFO] - Starting iteration 211. [2025-11-13 09:19:04,668][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:04,669][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:14,396][__main__][INFO] - Number of regex retries in iteration 211: 0 [2025-11-13 09:19:14,397][__main__][INFO] - agents played in iteration 211 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:19:14,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,941][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:14,942][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:15,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:16,239][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:16,893][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:17,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:17,543][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:17,868][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:18,192][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:18,516][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:19,169][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:19,818][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:20,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:20,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:21,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:21,442][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:22,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:23,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:23,384][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:23,708][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:24,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:24,681][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:25,008][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:25,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:25,665][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:25,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:26,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:27,511][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:27,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:27,514][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:28,469][__main__][INFO] - Iteration 212 took 23s (40.87% Gen, 55.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 34m 24s. Estimated total time: 19h 50m 3s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 20s. [2025-11-13 09:19:28,471][__main__][INFO] - Starting iteration 212. [2025-11-13 09:19:28,474][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:28,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:37,974][__main__][INFO] - Number of regex retries in iteration 212: 0 [2025-11-13 09:19:37,974][__main__][INFO] - agents played in iteration 212 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:19:38,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:38,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:39,250][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:39,547][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:39,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:40,523][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:40,847][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:41,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:42,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:42,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:42,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:43,441][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:43,765][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:44,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:45,384][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:46,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:46,691][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:47,344][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:47,992][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:49,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:50,337][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:51,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:51,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:51,060][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:52,010][__main__][INFO] - Iteration 213 took 23s (40.36% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 49s. Estimated total time: 19h 36m 52s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 13s, 500 more iterations: 3h 16m 8s. [2025-11-13 09:19:52,013][__main__][INFO] - Starting iteration 213. [2025-11-13 09:19:52,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:52,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:01,556][__main__][INFO] - Number of regex retries in iteration 213: 0 [2025-11-13 09:20:01,557][__main__][INFO] - agents played in iteration 213 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:20:02,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:02,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:02,832][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:03,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:03,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:04,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:05,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:05,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:06,057][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:06,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:07,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:07,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:07,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:08,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:08,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:08,658][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:09,308][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:09,958][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:10,281][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:11,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:11,578][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:11,902][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:12,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:12,875][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:13,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:13,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:14,630][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:14,631][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:14,633][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:15,675][__main__][INFO] - Iteration 214 took 23s (40.32% Gen, 55.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 36s. Estimated total time: 19h 43m 2s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 10s. [2025-11-13 09:20:15,677][__main__][INFO] - Starting iteration 214. [2025-11-13 09:20:15,681][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:15,681][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:25,029][__main__][INFO] - Number of regex retries in iteration 214: 0 [2025-11-13 09:20:25,030][__main__][INFO] - agents played in iteration 214 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:20:25,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:25,567][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:26,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:26,588][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:27,893][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:28,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:28,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:29,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:29,518][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:29,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:30,818][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:31,141][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:31,465][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:32,439][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:32,764][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:33,087][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:33,412][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:33,737][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:34,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:34,385][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:35,033][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:35,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:36,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:37,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:38,094][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:38,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:38,097][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:39,092][__main__][INFO] - Iteration 215 took 23s (39.93% Gen, 55.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 47s. Estimated total time: 19h 30m 37s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 6s. [2025-11-13 09:20:39,094][__main__][INFO] - Starting iteration 215. [2025-11-13 09:20:39,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:39,097][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:48,700][__main__][INFO] - Number of regex retries in iteration 215: 0 [2025-11-13 09:20:48,701][__main__][INFO] - agents played in iteration 215 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:20:49,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,245][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:49,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:49,986][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:50,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:50,935][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:51,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:51,909][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:52,235][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:52,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:53,209][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:54,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:54,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:54,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:55,157][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:55,808][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:56,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:56,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:57,106][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:57,430][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:57,756][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:58,079][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:58,403][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:58,728][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:59,053][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:59,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:59,701][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:00,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:00,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:01,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:01,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:01,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:01,840][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:02,834][__main__][INFO] - Iteration 216 took 23s (40.46% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 40s. Estimated total time: 19h 46m 54s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 49s. [2025-11-13 09:21:02,836][__main__][INFO] - Starting iteration 216. [2025-11-13 09:21:02,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:02,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:12,502][__main__][INFO] - Number of regex retries in iteration 216: 0 [2025-11-13 09:21:12,503][__main__][INFO] - agents played in iteration 216 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:21:12,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:13,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:13,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:13,048][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:13,048][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:13,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:14,061][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:14,711][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:15,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:15,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:15,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:16,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:16,330][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:16,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:16,982][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:17,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:17,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:17,968][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:18,293][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:18,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:18,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:19,264][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:19,590][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:19,914][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:20,238][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:20,562][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:20,886][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:21,211][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:21,863][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:22,189][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:22,838][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:23,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:23,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:23,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:24,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:24,873][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:25,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:25,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:25,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:26,530][__main__][INFO] - Iteration 217 took 23s (40.78% Gen, 55.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 58s. Estimated total time: 19h 44m 35s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 25s. [2025-11-13 09:21:26,533][__main__][INFO] - Starting iteration 217. [2025-11-13 09:21:26,535][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:26,536][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:36,273][__main__][INFO] - Number of regex retries in iteration 217: 0 [2025-11-13 09:21:36,274][__main__][INFO] - agents played in iteration 217 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:21:36,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:36,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:38,187][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:38,517][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:39,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:39,818][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:40,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:40,466][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:40,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:41,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:41,780][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:42,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:42,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:42,766][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:43,093][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:43,416][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:43,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:44,725][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:45,055][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:45,379][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:46,030][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:46,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:46,681][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:47,333][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:47,661][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:47,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:48,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:49,446][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:49,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:49,450][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:50,341][__main__][INFO] - Iteration 218 took 23s (40.91% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 32m 17s. Estimated total time: 19h 50m 18s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 23s. [2025-11-13 09:21:50,343][__main__][INFO] - Starting iteration 218. [2025-11-13 09:21:50,345][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:50,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:59,924][__main__][INFO] - Number of regex retries in iteration 218: 0 [2025-11-13 09:21:59,925][__main__][INFO] - agents played in iteration 218 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:22:00,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:00,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:00,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:01,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:01,486][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:01,814][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:02,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:04,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:04,411][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:04,734][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:05,058][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:05,383][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:05,707][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:06,031][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:06,355][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:07,002][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:07,650][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:07,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:08,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:09,271][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:09,595][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:09,918][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:10,892][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:11,542][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:12,276][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:13,011][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:13,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:13,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:13,933][__main__][INFO] - Iteration 219 took 23s (40.61% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 21m 1s. Estimated total time: 19h 39m 26s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 34s. [2025-11-13 09:22:13,936][__main__][INFO] - Starting iteration 219. [2025-11-13 09:22:13,938][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:13,939][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:23,467][__main__][INFO] - Number of regex retries in iteration 219: 0 [2025-11-13 09:22:23,468][__main__][INFO] - agents played in iteration 219 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:22:23,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:24,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:24,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:24,006][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:24,729][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:25,350][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:25,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:26,002][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:26,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:26,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:26,977][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:27,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:27,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:28,273][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:28,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:28,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:29,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:29,893][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:30,217][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:30,540][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:30,864][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:31,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:32,810][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:33,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:33,461][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:34,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:34,761][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:35,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:35,822][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:36,533][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:36,535][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:36,537][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:37,441][__main__][INFO] - Iteration 220 took 23s (40.54% Gen, 55.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 16m 21s. Estimated total time: 19h 35m 9s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 51s. [2025-11-13 09:22:37,443][__main__][INFO] - Starting iteration 220. [2025-11-13 09:22:37,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:37,446][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:46,756][__main__][INFO] - Number of regex retries in iteration 220: 0 [2025-11-13 09:22:46,756][__main__][INFO] - agents played in iteration 220 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:22:47,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:47,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:47,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:48,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:48,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:49,291][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:49,616][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:50,266][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:50,593][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:50,917][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:51,240][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:51,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:51,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:52,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:52,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:53,510][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:54,487][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:54,811][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:55,461][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:55,784][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:56,107][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:56,432][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:56,758][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:58,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:58,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:59,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:59,824][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:59,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:59,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:01,596][__main__][INFO] - Iteration 221 took 24s (38.55% Gen, 54.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 48m 21s. Estimated total time: 20h 7m 33s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 15s. [2025-11-13 09:23:01,598][__main__][INFO] - Starting iteration 221. [2025-11-13 09:23:01,600][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:01,601][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:10,620][__main__][INFO] - Number of regex retries in iteration 221: 0 [2025-11-13 09:23:10,620][__main__][INFO] - agents played in iteration 221 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:23:11,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:11,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:11,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:11,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:11,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:11,139][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:11,814][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:12,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:12,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:12,761][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:13,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:13,735][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:14,059][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:14,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:15,031][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:15,354][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:16,002][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:16,326][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:16,650][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:16,974][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:17,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:17,951][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:18,273][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:18,923][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:19,245][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:19,571][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:20,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:20,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:20,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:21,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:21,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:22,174][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:22,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:23,618][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:23,619][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:23,622][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:24,575][__main__][INFO] - Iteration 222 took 22s (39.26% Gen, 56.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 11s. Estimated total time: 19h 8m 46s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 27s. [2025-11-13 09:23:24,577][__main__][INFO] - Starting iteration 222. [2025-11-13 09:23:24,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:24,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:34,328][__main__][INFO] - Number of regex retries in iteration 222: 0 [2025-11-13 09:23:34,329][__main__][INFO] - agents played in iteration 222 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:23:34,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:34,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:34,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:35,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:36,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:37,162][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:37,486][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:37,811][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:38,135][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:38,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:39,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:40,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:40,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:40,734][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:41,058][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:41,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:42,031][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:42,356][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:42,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:43,004][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:43,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:44,307][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:44,964][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:45,618][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:45,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:46,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:47,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:47,379][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:47,380][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:48,297][__main__][INFO] - Iteration 223 took 23s (41.10% Gen, 55.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 55s. Estimated total time: 19h 45m 54s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 39s. [2025-11-13 09:23:48,299][__main__][INFO] - Starting iteration 223. [2025-11-13 09:23:48,301][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:48,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:58,067][__main__][INFO] - Number of regex retries in iteration 223: 0 [2025-11-13 09:23:58,067][__main__][INFO] - agents played in iteration 223 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:23:58,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:58,607][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:58,607][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:59,607][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:59,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:00,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:00,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:00,911][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:01,242][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:01,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:02,224][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:03,207][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:03,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:04,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:04,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:04,839][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:05,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:05,821][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:06,149][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:06,473][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:06,799][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:07,777][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:08,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:09,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:10,491][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:11,178][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:11,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:11,181][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:12,065][__main__][INFO] - Iteration 224 took 23s (41.09% Gen, 55.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 50s. Estimated total time: 19h 48m 13s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 36s, 500 more iterations: 3h 18m 2s. [2025-11-13 09:24:12,067][__main__][INFO] - Starting iteration 224. [2025-11-13 09:24:12,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:12,070][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:21,084][__main__][INFO] - Number of regex retries in iteration 224: 0 [2025-11-13 09:24:21,085][__main__][INFO] - agents played in iteration 224 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:24:21,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,651][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:21,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:22,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:23,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:23,764][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:24,092][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:24,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:25,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:25,388][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:25,713][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:26,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:26,687][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:27,011][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:27,662][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:27,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:28,310][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:28,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:28,959][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:29,616][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:30,266][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:30,590][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:30,917][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:31,242][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:31,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:31,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:32,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:32,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:33,612][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:34,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:34,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:34,332][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:35,226][__main__][INFO] - Iteration 225 took 23s (38.93% Gen, 57.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 4s. Estimated total time: 19h 17m 50s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 58s. [2025-11-13 09:24:35,228][__main__][INFO] - Starting iteration 225. [2025-11-13 09:24:35,230][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:35,231][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:44,693][__main__][INFO] - Number of regex retries in iteration 225: 0 [2025-11-13 09:24:44,694][__main__][INFO] - agents played in iteration 225 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:24:45,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:45,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:45,234][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:45,960][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:46,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:46,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:46,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:47,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:47,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:48,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:49,510][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:49,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:50,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:50,810][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:51,135][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:52,110][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:52,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:52,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:53,408][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:54,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:54,710][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:55,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:55,685][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:56,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:57,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:57,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:57,770][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:57,771][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:58,670][__main__][INFO] - Iteration 226 took 23s (40.37% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 10m 52s. Estimated total time: 19h 32m 1s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 20s. [2025-11-13 09:24:58,672][__main__][INFO] - Starting iteration 226. [2025-11-13 09:24:58,674][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:58,675][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:08,269][__main__][INFO] - Number of regex retries in iteration 226: 0 [2025-11-13 09:25:08,269][__main__][INFO] - agents played in iteration 226 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:25:08,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,811][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:08,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:09,531][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:09,827][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:10,151][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:10,474][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:10,799][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:11,123][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:11,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:11,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:12,096][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:12,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:12,744][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:13,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:13,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:14,366][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:14,690][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:15,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:15,663][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:15,987][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:16,313][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:16,637][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:16,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:17,286][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:18,260][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:18,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:18,909][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:19,235][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:19,561][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:19,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:20,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:21,303][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:21,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:21,307][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:22,277][__main__][INFO] - Iteration 227 took 23s (40.65% Gen, 55.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 36s. Estimated total time: 19h 40m 9s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 41s. [2025-11-13 09:25:22,279][__main__][INFO] - Starting iteration 227. [2025-11-13 09:25:22,282][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:22,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:31,235][__main__][INFO] - Number of regex retries in iteration 227: 0 [2025-11-13 09:25:31,235][__main__][INFO] - agents played in iteration 227 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:25:31,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:31,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:31,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:31,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:31,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:31,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:33,770][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:34,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:34,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:34,747][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:35,071][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:36,371][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:36,695][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:37,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:37,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:37,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:38,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:38,968][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:39,292][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:39,616][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:40,266][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:40,915][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:41,239][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:41,565][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:41,890][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:42,542][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:42,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:43,585][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:44,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:44,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:44,303][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:45,221][__main__][INFO] - Iteration 228 took 22s (39.03% Gen, 56.97% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 45m 2s. Estimated total time: 19h 6m 58s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 9s. [2025-11-13 09:25:45,223][__main__][INFO] - Starting iteration 228. [2025-11-13 09:25:45,225][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:45,226][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:54,679][__main__][INFO] - Number of regex retries in iteration 228: 0 [2025-11-13 09:25:54,680][__main__][INFO] - agents played in iteration 228 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:25:55,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,232][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:55,232][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:55,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:56,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:56,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:56,909][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:57,233][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:57,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:57,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:58,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:59,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:59,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:59,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:00,160][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:00,485][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:00,808][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:01,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:01,457][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:02,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:03,081][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:03,738][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:04,388][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:05,037][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:06,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:07,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:07,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:07,767][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:07,769][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:08,667][__main__][INFO] - Iteration 229 took 23s (40.33% Gen, 55.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 48s. Estimated total time: 19h 32m 7s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 21s. [2025-11-13 09:26:08,669][__main__][INFO] - Starting iteration 229. [2025-11-13 09:26:08,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:08,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:17,949][__main__][INFO] - Number of regex retries in iteration 229: 0 [2025-11-13 09:26:17,950][__main__][INFO] - agents played in iteration 229 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:26:18,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,490][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:18,491][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:19,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:19,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:19,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:20,179][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:20,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:20,834][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:21,160][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:21,483][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:21,810][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:22,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:22,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:23,109][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:23,435][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:23,760][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:24,409][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:25,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:26,682][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:27,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:27,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:28,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:29,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:29,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:30,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:31,059][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:31,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:31,062][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:31,973][__main__][INFO] - Iteration 230 took 23s (39.81% Gen, 56.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 22s. Estimated total time: 19h 25m 5s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 10s. [2025-11-13 09:26:31,975][__main__][INFO] - Starting iteration 230. [2025-11-13 09:26:31,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:31,978][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:40,991][__main__][INFO] - Number of regex retries in iteration 230: 0 [2025-11-13 09:26:40,992][__main__][INFO] - agents played in iteration 230 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:26:41,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:41,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:41,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:41,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:41,541][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:41,541][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:42,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:42,883][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:43,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:43,857][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:44,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:44,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:45,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:45,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:45,803][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:46,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:46,451][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:46,776][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:47,101][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:47,425][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:47,749][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:48,073][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:48,398][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:48,724][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:49,049][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:49,374][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:49,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:50,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:50,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:50,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:50,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:51,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:51,976][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:52,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:52,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:53,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:54,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:54,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:54,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:55,885][__main__][INFO] - Iteration 231 took 23s (37.70% Gen, 54.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 32m 17s. Estimated total time: 19h 55m 23s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 13s. [2025-11-13 09:26:55,887][__main__][INFO] - Starting iteration 231. [2025-11-13 09:26:55,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:26:55,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:05,238][__main__][INFO] - Number of regex retries in iteration 231: 0 [2025-11-13 09:27:05,239][__main__][INFO] - agents played in iteration 231 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:27:05,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,786][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:05,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:06,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:07,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:07,468][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:07,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:08,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:08,444][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:09,092][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:09,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:09,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:10,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:10,406][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:11,065][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:11,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:11,714][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:12,365][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:13,017][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:13,669][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:13,999][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:14,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:14,978][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:15,303][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:15,628][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:15,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:16,279][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:16,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:17,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:18,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:18,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:18,407][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:19,415][__main__][INFO] - Iteration 232 took 23s (39.73% Gen, 55.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 46s. Estimated total time: 19h 36m 17s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 2s. [2025-11-13 09:27:19,417][__main__][INFO] - Starting iteration 232. [2025-11-13 09:27:19,420][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:19,421][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:28,748][__main__][INFO] - Number of regex retries in iteration 232: 0 [2025-11-13 09:27:28,749][__main__][INFO] - agents played in iteration 232 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:27:29,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:29,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:29,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:29,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:29,277][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:29,277][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:29,997][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:30,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:31,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:31,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:33,224][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:33,549][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:33,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:34,207][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:34,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:34,859][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:35,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:35,515][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:35,840][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:36,493][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:36,818][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:37,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:37,467][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:37,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:38,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:38,768][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:40,397][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:41,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:41,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:41,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:41,812][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:42,787][__main__][INFO] - Iteration 233 took 23s (39.92% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 28s. Estimated total time: 19h 28m 22s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 43s. [2025-11-13 09:27:42,789][__main__][INFO] - Starting iteration 233. [2025-11-13 09:27:42,792][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:42,792][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:51,868][__main__][INFO] - Number of regex retries in iteration 233: 0 [2025-11-13 09:27:51,869][__main__][INFO] - agents played in iteration 233 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:27:52,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:52,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:52,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:52,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:52,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:52,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:53,396][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:54,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:54,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:55,355][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:55,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:56,005][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:56,331][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:56,658][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:56,984][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:57,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:57,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:57,960][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:58,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:58,940][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:59,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:59,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:59,922][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:00,247][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:00,575][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:00,900][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:01,550][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:01,877][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:02,202][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:02,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:03,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:04,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:04,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:04,908][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:04,910][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:05,895][__main__][INFO] - Iteration 234 took 23s (39.28% Gen, 56.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 56s. Estimated total time: 19h 15m 13s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 32s. [2025-11-13 09:28:05,898][__main__][INFO] - Starting iteration 234. [2025-11-13 09:28:05,901][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:05,902][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:15,470][__main__][INFO] - Number of regex retries in iteration 234: 0 [2025-11-13 09:28:15,471][__main__][INFO] - agents played in iteration 234 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:28:15,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:15,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:15,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:15,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:15,993][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:15,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:16,722][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:17,345][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:17,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:18,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:18,988][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:19,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:19,642][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:20,621][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:20,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:21,272][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:21,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:21,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:22,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:22,571][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:22,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:23,224][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:23,549][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:23,875][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:24,855][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:25,181][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:25,507][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:25,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:26,162][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:26,813][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:27,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:27,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:28,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:28,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:28,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:29,522][__main__][INFO] - Iteration 235 took 23s (40.51% Gen, 55.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 16m 24s. Estimated total time: 19h 41m 5s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 50s. [2025-11-13 09:28:29,524][__main__][INFO] - Starting iteration 235. [2025-11-13 09:28:29,528][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:29,529][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:38,251][__main__][INFO] - Number of regex retries in iteration 235: 0 [2025-11-13 09:28:38,251][__main__][INFO] - agents played in iteration 235 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:28:38,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:38,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:38,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:38,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:38,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:38,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:39,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:40,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:40,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:41,103][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:41,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:41,754][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:42,084][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:42,415][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:43,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:44,041][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:44,366][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:44,691][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:45,341][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:45,668][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:46,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:46,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:46,969][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:47,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:47,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:48,273][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:48,598][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:48,926][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:49,250][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:49,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:49,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:50,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:51,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:51,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:51,319][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:52,300][__main__][INFO] - Iteration 236 took 22s (38.30% Gen, 57.39% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 33m 34s. Estimated total time: 18h 58m 37s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 46s. [2025-11-13 09:28:52,302][__main__][INFO] - Starting iteration 236. [2025-11-13 09:28:52,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:52,306][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:01,697][__main__][INFO] - Number of regex retries in iteration 236: 0 [2025-11-13 09:29:01,697][__main__][INFO] - agents played in iteration 236 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:29:02,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:02,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:02,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:02,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:02,232][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:02,233][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:02,957][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:03,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:03,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:04,237][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:04,894][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:05,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:06,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:06,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:07,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:07,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:08,164][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:08,490][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:08,816][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:09,142][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:09,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:09,792][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:10,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:10,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:11,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:11,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:12,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:12,740][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:13,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:14,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:14,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:14,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:14,817][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:15,782][__main__][INFO] - Iteration 237 took 23s (40.00% Gen, 55.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 28s. Estimated total time: 19h 33m 54s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 39s. [2025-11-13 09:29:15,784][__main__][INFO] - Starting iteration 237. [2025-11-13 09:29:15,788][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:15,788][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:24,758][__main__][INFO] - Number of regex retries in iteration 237: 0 [2025-11-13 09:29:24,758][__main__][INFO] - agents played in iteration 237 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:29:25,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:25,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:25,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:25,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:25,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:25,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:26,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:26,302][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:26,629][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:26,957][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:27,939][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:28,915][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:29,566][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:31,192][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:31,516][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:32,168][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:32,492][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:32,816][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:33,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:33,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:34,120][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:34,444][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:35,419][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:36,070][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:36,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:37,102][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:37,820][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:37,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:37,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:38,786][__main__][INFO] - Iteration 238 took 22s (39.00% Gen, 56.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 8s. Estimated total time: 19h 9m 57s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 39s. [2025-11-13 09:29:38,788][__main__][INFO] - Starting iteration 238. [2025-11-13 09:29:38,792][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:38,792][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:47,547][__main__][INFO] - Number of regex retries in iteration 238: 0 [2025-11-13 09:29:47,548][__main__][INFO] - agents played in iteration 238 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:29:47,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:48,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:48,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:48,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:48,073][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:48,074][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:48,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:49,097][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:50,077][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:50,733][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:51,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:51,387][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:52,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:52,361][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:52,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:53,010][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:53,335][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:53,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:53,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:54,311][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:54,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:54,963][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:55,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:55,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:55,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:56,261][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:56,586][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:57,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:57,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:58,539][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:58,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:59,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:59,946][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:00,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:00,678][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:00,679][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:01,669][__main__][INFO] - Iteration 239 took 22s (38.27% Gen, 57.40% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 40s. Estimated total time: 19h 3m 53s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 38s. [2025-11-13 09:30:01,671][__main__][INFO] - Starting iteration 239. [2025-11-13 09:30:01,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:01,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:11,080][__main__][INFO] - Number of regex retries in iteration 239: 0 [2025-11-13 09:30:11,081][__main__][INFO] - agents played in iteration 239 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:30:11,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:11,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:11,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:11,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:11,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:11,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:12,649][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:12,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:13,308][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:14,289][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:14,938][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:15,264][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:15,589][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:15,914][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:16,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:16,565][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:16,891][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:17,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:17,869][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:18,196][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:18,522][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:18,849][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:19,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:19,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:19,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:20,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:21,130][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:21,456][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:21,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:22,432][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:22,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:23,442][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:24,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:24,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:24,169][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:25,133][__main__][INFO] - Iteration 240 took 23s (40.09% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 6m 17s. Estimated total time: 19h 32m 53s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 28s. [2025-11-13 09:30:25,135][__main__][INFO] - Starting iteration 240. [2025-11-13 09:30:25,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:25,139][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:33,530][__main__][INFO] - Number of regex retries in iteration 240: 0 [2025-11-13 09:30:33,531][__main__][INFO] - agents played in iteration 240 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:30:33,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:34,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:34,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:34,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:34,070][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:34,070][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:34,782][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:35,079][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:35,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:36,057][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:36,707][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:37,033][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:37,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:37,686][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:38,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:38,339][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:38,665][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:38,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:39,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:39,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:39,968][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:40,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:40,621][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:40,948][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:41,275][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:41,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:41,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:42,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:42,574][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:42,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:43,556][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:43,882][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:44,207][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:44,533][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:44,857][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:45,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:45,865][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:46,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:46,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:46,614][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:48,539][__main__][INFO] - Iteration 241 took 23s (35.86% Gen, 55.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 4s. Estimated total time: 19h 30m 3s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 0s. [2025-11-13 09:30:48,541][__main__][INFO] - Starting iteration 241. [2025-11-13 09:30:48,544][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:30:48,544][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:58,201][__main__][INFO] - Number of regex retries in iteration 241: 0 [2025-11-13 09:30:58,202][__main__][INFO] - agents played in iteration 241 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:30:58,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:58,744][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:58,744][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:00,103][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:00,429][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:00,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:01,083][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:01,410][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:01,737][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:02,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:03,369][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:03,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:04,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:04,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:05,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:05,329][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:05,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:05,982][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:06,306][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:06,630][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:06,955][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:07,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:07,602][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:08,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:08,906][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:09,234][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:09,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:10,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:11,330][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:11,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:11,333][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:12,302][__main__][INFO] - Iteration 242 took 23s (40.65% Gen, 55.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 32s. Estimated total time: 19h 47m 55s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 59s. [2025-11-13 09:31:12,304][__main__][INFO] - Starting iteration 242. [2025-11-13 09:31:12,307][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:12,307][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:21,500][__main__][INFO] - Number of regex retries in iteration 242: 0 [2025-11-13 09:31:21,500][__main__][INFO] - agents played in iteration 242 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:31:21,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:21,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:22,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:22,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:22,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:22,052][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:22,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:23,083][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:23,737][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:24,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:24,712][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:25,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:25,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:26,017][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:26,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:26,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:26,993][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:27,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:27,968][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:28,291][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:28,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:28,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:29,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:29,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:29,918][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:30,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:30,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:30,892][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:31,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:31,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:31,865][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:32,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:32,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:33,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:33,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:34,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:34,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:34,553][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:35,508][__main__][INFO] - Iteration 243 took 23s (39.62% Gen, 56.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 19s. Estimated total time: 19h 20m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 20s. [2025-11-13 09:31:35,510][__main__][INFO] - Starting iteration 243. [2025-11-13 09:31:35,514][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:35,515][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:44,969][__main__][INFO] - Number of regex retries in iteration 243: 0 [2025-11-13 09:31:44,970][__main__][INFO] - agents played in iteration 243 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:31:45,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:45,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:45,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:46,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:46,551][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:46,878][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:47,855][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:48,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:48,508][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:49,169][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:51,136][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:51,797][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:52,121][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:52,446][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:53,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:53,751][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:54,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:54,726][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:55,050][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:55,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:55,698][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:56,349][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:56,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:57,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:58,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:58,088][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:58,090][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:59,054][__main__][INFO] - Iteration 244 took 23s (40.17% Gen, 55.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 53s. Estimated total time: 19h 37m 2s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 10s. [2025-11-13 09:31:59,056][__main__][INFO] - Starting iteration 244. [2025-11-13 09:31:59,060][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:59,060][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:07,720][__main__][INFO] - Number of regex retries in iteration 244: 0 [2025-11-13 09:32:07,721][__main__][INFO] - agents played in iteration 244 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:32:08,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:08,620][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:08,620][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:09,356][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:09,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:09,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:10,305][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:11,612][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:11,936][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:12,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:12,588][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:12,913][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:13,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:13,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:13,891][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:14,220][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:14,545][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:14,870][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:15,197][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:15,525][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:16,500][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:17,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:17,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:18,124][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:19,099][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:19,423][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:19,747][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:20,437][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:21,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:21,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:21,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:22,134][__main__][INFO] - Iteration 245 took 23s (37.53% Gen, 58.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 45m 13s. Estimated total time: 19h 13m 46s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 17s. [2025-11-13 09:32:22,136][__main__][INFO] - Starting iteration 245. [2025-11-13 09:32:22,140][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:22,140][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:31,507][__main__][INFO] - Number of regex retries in iteration 245: 0 [2025-11-13 09:32:31,507][__main__][INFO] - agents played in iteration 245 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:32:31,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:31,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:32,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:32,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:32,050][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:32,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:33,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:33,407][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:33,733][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:34,058][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:34,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:34,710][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:35,035][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:35,359][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:35,684][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:36,332][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:36,983][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:37,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:37,641][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:38,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:38,617][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:38,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:39,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:39,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:40,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:40,564][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:40,888][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:41,212][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:41,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:41,865][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:42,190][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:42,515][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:42,840][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:43,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:43,870][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:44,600][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:44,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:44,604][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:45,567][__main__][INFO] - Iteration 246 took 23s (39.98% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 28s. Estimated total time: 19h 31m 25s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 14s. [2025-11-13 09:32:45,569][__main__][INFO] - Starting iteration 246. [2025-11-13 09:32:45,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:45,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:54,328][__main__][INFO] - Number of regex retries in iteration 246: 0 [2025-11-13 09:32:54,328][__main__][INFO] - agents played in iteration 246 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:32:54,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:54,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:54,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:54,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:54,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:54,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:55,608][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:56,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:56,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:56,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:57,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:57,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:57,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:58,514][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:59,167][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:00,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:00,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:00,803][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:01,129][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:01,453][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:01,779][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:02,754][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:03,078][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:03,403][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:03,727][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:04,052][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:04,376][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:04,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:05,025][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:05,350][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:05,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:06,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:06,684][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:07,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:07,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:07,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:08,389][__main__][INFO] - Iteration 247 took 22s (38.37% Gen, 57.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 31m 30s. Estimated total time: 19h 0m 49s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 1s, 500 more iterations: 3h 10m 8s. [2025-11-13 09:33:08,391][__main__][INFO] - Starting iteration 247. [2025-11-13 09:33:08,395][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:08,395][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:17,127][__main__][INFO] - Number of regex retries in iteration 247: 0 [2025-11-13 09:33:17,127][__main__][INFO] - agents played in iteration 247 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:33:17,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:17,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:17,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:17,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:17,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:17,666][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:18,409][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:18,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:19,033][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:19,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:20,009][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:20,337][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:20,663][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:21,315][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:21,640][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:22,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:22,617][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:22,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:23,593][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:23,919][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:24,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:24,568][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:24,892][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:25,217][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:26,515][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:27,163][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:27,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:28,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:28,786][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:29,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:30,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:30,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:30,313][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:31,298][__main__][INFO] - Iteration 248 took 22s (38.12% Gen, 57.57% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 35m 30s. Estimated total time: 19h 5m 12s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 52s. [2025-11-13 09:33:31,300][__main__][INFO] - Starting iteration 248. [2025-11-13 09:33:31,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:31,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:39,711][__main__][INFO] - Number of regex retries in iteration 248: 0 [2025-11-13 09:33:39,712][__main__][INFO] - agents played in iteration 248 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:33:40,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:40,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:40,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:40,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:40,246][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:40,246][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:40,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:41,272][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:41,598][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:41,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:42,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:42,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:42,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:43,230][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:43,557][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:43,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:44,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:44,535][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:44,865][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:45,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:45,842][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:46,171][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:46,497][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:47,146][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:47,796][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:48,120][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:48,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:50,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:50,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:50,725][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:51,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:51,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:52,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:52,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:52,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:52,785][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:53,778][__main__][INFO] - Iteration 249 took 22s (37.41% Gen, 58.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 13m 42s. Estimated total time: 18h 43m 47s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 27s, 500 more iterations: 3h 7m 17s. [2025-11-13 09:33:53,780][__main__][INFO] - Starting iteration 249. [2025-11-13 09:33:53,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:53,784][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:02,738][__main__][INFO] - Number of regex retries in iteration 249: 0 [2025-11-13 09:34:02,739][__main__][INFO] - agents played in iteration 249 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:34:03,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:03,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:03,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:03,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:03,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:03,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:04,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:04,975][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:05,950][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:06,278][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:06,609][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:06,934][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:07,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:07,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:07,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:08,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:08,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:08,887][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:09,217][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:09,541][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:10,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:11,166][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:11,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:11,815][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:12,464][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:12,789][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:14,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:14,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:15,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:15,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:15,842][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:15,843][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:16,817][__main__][INFO] - Iteration 250 took 23s (38.87% Gen, 56.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 41m 15s. Estimated total time: 19h 11m 42s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 57s. [2025-11-13 09:34:16,819][__main__][INFO] - Starting iteration 250. [2025-11-13 09:34:16,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:16,823][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:25,726][__main__][INFO] - Number of regex retries in iteration 250: 0 [2025-11-13 09:34:25,726][__main__][INFO] - agents played in iteration 250 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:34:26,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:26,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:26,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:26,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:26,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:26,266][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:26,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:27,285][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:28,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:28,594][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:28,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:29,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:29,575][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:29,900][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:30,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:31,213][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:31,543][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:31,867][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:32,191][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:32,516][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:32,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:33,168][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:33,493][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:34,791][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:35,115][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:36,088][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:36,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:36,738][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:37,064][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:37,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:38,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:38,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:38,786][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:38,787][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:40,716][__main__][INFO] - Iteration 251 took 23s (37.26% Gen, 54.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 51s. Estimated total time: 19h 54m 42s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 7s. [2025-11-13 09:34:40,718][__main__][INFO] - Starting iteration 251. [2025-11-13 09:34:40,721][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:34:40,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:50,478][__main__][INFO] - Number of regex retries in iteration 251: 0 [2025-11-13 09:34:50,479][__main__][INFO] - agents played in iteration 251 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:34:50,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:50,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:50,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:51,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:51,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:51,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:51,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:52,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:52,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:53,036][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:53,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:53,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:54,015][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:54,339][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:54,663][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:55,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:55,972][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:56,297][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:56,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:56,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:57,273][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:57,598][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:57,924][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:58,573][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:58,898][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:59,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:59,551][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:59,879][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:00,854][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:01,179][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:01,505][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:01,830][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:02,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:02,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:03,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:03,603][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:03,605][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:04,597][__main__][INFO] - Iteration 252 took 23s (40.86% Gen, 54.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 36s. Estimated total time: 19h 53m 51s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 58s. [2025-11-13 09:35:04,600][__main__][INFO] - Starting iteration 252. [2025-11-13 09:35:04,602][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:04,603][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:14,108][__main__][INFO] - Number of regex retries in iteration 252: 0 [2025-11-13 09:35:14,108][__main__][INFO] - agents played in iteration 252 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:35:14,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:14,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:14,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:14,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:14,663][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:14,663][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:15,686][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:16,012][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:16,663][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:16,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:17,312][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:17,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:17,966][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:18,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:18,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:19,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:19,613][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:19,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:20,267][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:20,594][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:20,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:21,573][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:22,555][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:22,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:23,206][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:23,533][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:23,860][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:24,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:24,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:24,833][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:25,158][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:25,484][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:25,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:26,477][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:27,213][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:27,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:27,216][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:28,226][__main__][INFO] - Iteration 253 took 23s (40.23% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 34s. Estimated total time: 19h 41m 13s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 52s. [2025-11-13 09:35:28,228][__main__][INFO] - Starting iteration 253. [2025-11-13 09:35:28,231][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:28,232][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:37,590][__main__][INFO] - Number of regex retries in iteration 253: 0 [2025-11-13 09:35:37,591][__main__][INFO] - agents played in iteration 253 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:35:38,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:38,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:38,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:38,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:38,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:38,129][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:38,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:39,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:40,455][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:41,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:41,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:42,082][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:42,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:43,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:43,390][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:43,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:44,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:45,668][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:46,318][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:46,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:47,289][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:48,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:48,587][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:48,911][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:49,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:49,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:50,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:50,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:50,650][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:51,625][__main__][INFO] - Iteration 254 took 23s (40.00% Gen, 55.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 42s. Estimated total time: 19h 29m 44s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 57s. [2025-11-13 09:35:51,628][__main__][INFO] - Starting iteration 254. [2025-11-13 09:35:51,631][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:51,631][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:00,368][__main__][INFO] - Number of regex retries in iteration 254: 0 [2025-11-13 09:36:00,369][__main__][INFO] - agents played in iteration 254 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:36:00,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:00,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:00,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:00,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:00,910][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:00,911][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:01,937][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:02,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:03,898][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:04,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:04,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:04,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:05,203][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:05,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:05,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:06,509][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:06,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:07,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:07,824][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:08,150][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:08,480][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:09,136][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:09,466][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:10,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:10,452][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:10,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:11,106][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:11,432][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:12,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:12,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:13,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:13,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:13,525][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:14,502][__main__][INFO] - Iteration 255 took 22s (38.20% Gen, 57.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 31m 9s. Estimated total time: 19h 3m 35s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 35s. [2025-11-13 09:36:14,504][__main__][INFO] - Starting iteration 255. [2025-11-13 09:36:14,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:14,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:23,183][__main__][INFO] - Number of regex retries in iteration 255: 0 [2025-11-13 09:36:23,183][__main__][INFO] - agents played in iteration 255 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:36:23,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:23,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:23,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:23,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:23,725][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:23,726][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:24,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:25,395][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:25,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:26,372][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:27,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:28,008][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:28,335][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:28,661][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:28,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:29,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:29,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:30,297][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:30,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:30,946][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:31,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:32,255][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:32,581][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:33,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:33,558][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:33,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:34,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:34,532][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:34,857][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:35,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:36,252][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:36,254][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:36,255][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:37,209][__main__][INFO] - Iteration 256 took 22s (38.22% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 22m 21s. Estimated total time: 18h 55m 9s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 11s. [2025-11-13 09:36:37,211][__main__][INFO] - Starting iteration 256. [2025-11-13 09:36:37,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:37,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:46,270][__main__][INFO] - Number of regex retries in iteration 256: 0 [2025-11-13 09:36:46,271][__main__][INFO] - agents played in iteration 256 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:36:46,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:46,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:46,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:46,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:46,814][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:46,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:48,163][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:48,817][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:49,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:49,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:49,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:50,127][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:50,454][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:51,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:52,084][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:53,392][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:54,369][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:54,693][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:55,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:55,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:56,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:56,969][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:57,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:57,621][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:57,946][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:58,615][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:59,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:59,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:59,346][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:00,307][__main__][INFO] - Iteration 257 took 23s (39.22% Gen, 56.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 27s. Estimated total time: 19h 14m 39s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 26s. [2025-11-13 09:37:00,309][__main__][INFO] - Starting iteration 257. [2025-11-13 09:37:00,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:00,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:08,890][__main__][INFO] - Number of regex retries in iteration 257: 0 [2025-11-13 09:37:08,891][__main__][INFO] - agents played in iteration 257 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:37:09,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:09,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:09,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:09,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:09,435][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:09,435][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:10,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:10,455][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:11,106][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:11,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:12,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:12,406][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:12,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:13,057][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:13,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:13,709][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:14,034][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:14,357][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:15,659][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:15,987][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:16,313][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:16,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:17,941][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:18,589][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:19,240][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:19,562][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:19,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:20,212][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:20,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:21,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:21,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:21,919][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:21,921][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:22,899][__main__][INFO] - Iteration 258 took 22s (37.98% Gen, 57.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 50s. Estimated total time: 18h 49m 23s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 13s. [2025-11-13 09:37:22,901][__main__][INFO] - Starting iteration 258. [2025-11-13 09:37:22,905][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:22,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:31,457][__main__][INFO] - Number of regex retries in iteration 258: 0 [2025-11-13 09:37:31,458][__main__][INFO] - agents played in iteration 258 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:37:31,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:31,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:31,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:31,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:32,000][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:32,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:32,729][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:33,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:34,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:34,978][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:35,634][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:35,964][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:36,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:36,943][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:37,268][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:37,916][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:38,241][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:38,900][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:39,230][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:39,563][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:39,892][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:40,546][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:40,873][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:41,530][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:41,860][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:43,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:43,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:44,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:44,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:44,555][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:45,517][__main__][INFO] - Iteration 259 took 22s (37.82% Gen, 57.92% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 41s. Estimated total time: 18h 50m 37s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 41s, 500 more iterations: 3h 8m 26s. [2025-11-13 09:37:45,519][__main__][INFO] - Starting iteration 259. [2025-11-13 09:37:45,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:45,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:54,328][__main__][INFO] - Number of regex retries in iteration 259: 0 [2025-11-13 09:37:54,329][__main__][INFO] - agents played in iteration 259 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:37:54,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:54,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:54,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:54,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:54,869][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:54,870][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:55,600][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:55,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:56,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:56,547][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:56,872][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:58,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:58,500][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:58,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:59,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:59,808][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:00,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:00,782][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:01,107][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:01,432][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:01,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:02,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:02,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:02,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:03,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:03,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:04,038][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:04,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:05,011][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:05,659][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:05,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:06,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:07,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:07,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:07,379][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:08,353][__main__][INFO] - Iteration 260 took 22s (38.57% Gen, 57.15% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 17s. Estimated total time: 19h 1m 36s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 16s. [2025-11-13 09:38:08,355][__main__][INFO] - Starting iteration 260. [2025-11-13 09:38:08,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:08,359][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:16,883][__main__][INFO] - Number of regex retries in iteration 260: 0 [2025-11-13 09:38:16,884][__main__][INFO] - agents played in iteration 260 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:38:17,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:17,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:17,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:17,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:17,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:17,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:18,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:18,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:19,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:19,746][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:20,071][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:20,398][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:21,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:21,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:22,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:22,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:23,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:23,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:24,309][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:24,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:24,963][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:25,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:25,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:26,264][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:26,590][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:26,914][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:27,239][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:27,890][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:28,214][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:28,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:29,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:29,956][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:29,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:29,959][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:31,852][__main__][INFO] - Iteration 261 took 23s (36.28% Gen, 55.65% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 0s. Estimated total time: 19h 34m 42s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 47s. [2025-11-13 09:38:31,854][__main__][INFO] - Starting iteration 261. [2025-11-13 09:38:31,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:38:31,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:40,675][__main__][INFO] - Number of regex retries in iteration 261: 0 [2025-11-13 09:38:40,676][__main__][INFO] - agents played in iteration 261 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:38:41,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:41,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:41,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:41,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:41,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:41,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:41,942][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:42,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:43,213][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:43,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:43,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:44,191][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:44,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:44,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:45,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:45,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:46,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:46,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:46,802][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:47,126][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:47,451][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:48,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:48,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:49,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:49,733][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:50,059][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:50,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:50,711][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:51,041][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:51,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:52,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:52,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:53,012][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:53,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:53,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:53,742][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:54,731][__main__][INFO] - Iteration 262 took 22s (38.55% Gen, 57.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 36s. Estimated total time: 19h 3m 42s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 37s. [2025-11-13 09:38:54,733][__main__][INFO] - Starting iteration 262. [2025-11-13 09:38:54,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:38:54,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:03,654][__main__][INFO] - Number of regex retries in iteration 262: 0 [2025-11-13 09:39:03,655][__main__][INFO] - agents played in iteration 262 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:39:04,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:04,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:04,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:04,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:04,210][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:04,211][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:04,939][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:05,234][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:05,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:05,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:06,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:06,536][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:07,187][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:07,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:08,165][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:08,490][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:08,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:09,141][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:09,466][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:09,792][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:10,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:10,767][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:11,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:11,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:11,748][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:12,398][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:12,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:13,054][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:13,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:13,706][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:14,036][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:14,360][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:15,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:15,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:16,001][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:16,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:16,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:16,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:17,703][__main__][INFO] - Iteration 263 took 22s (38.83% Gen, 56.97% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 32m 52s. Estimated total time: 19h 8m 20s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 23s. [2025-11-13 09:39:17,705][__main__][INFO] - Starting iteration 263. [2025-11-13 09:39:17,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:17,708][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:26,486][__main__][INFO] - Number of regex retries in iteration 263: 0 [2025-11-13 09:39:26,487][__main__][INFO] - agents played in iteration 263 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:39:26,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:26,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:27,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:27,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:27,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:27,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:27,776][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:28,072][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:28,723][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:29,373][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:29,698][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:30,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:30,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:30,673][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:30,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:31,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:31,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:32,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:32,628][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:32,957][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:33,283][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:33,608][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:34,261][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:34,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:34,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:35,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:35,574][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:35,901][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:36,228][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:36,553][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:36,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:37,211][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:37,535][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:37,861][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:38,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:38,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:39,612][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:39,613][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:39,615][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:40,584][__main__][INFO] - Iteration 264 took 22s (38.37% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 0s. Estimated total time: 19h 3m 51s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 38s. [2025-11-13 09:39:40,586][__main__][INFO] - Starting iteration 264. [2025-11-13 09:39:40,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:40,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:48,888][__main__][INFO] - Number of regex retries in iteration 264: 0 [2025-11-13 09:39:48,889][__main__][INFO] - agents played in iteration 264 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:39:49,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:49,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:49,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:49,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:49,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:49,438][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:50,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:50,470][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:51,124][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:52,751][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:53,076][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:53,402][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:53,727][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:54,052][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:54,378][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:54,703][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:55,028][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:55,354][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:55,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:56,003][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:56,328][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:56,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:56,982][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:57,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:57,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:58,295][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:58,945][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:59,597][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:59,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:00,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:00,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:01,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:02,000][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:02,002][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:02,003][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:02,971][__main__][INFO] - Iteration 265 took 22s (37.08% Gen, 58.60% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 2m 53s. Estimated total time: 18h 39m 6s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 18s, 500 more iterations: 3h 6m 31s. [2025-11-13 09:40:02,973][__main__][INFO] - Starting iteration 265. [2025-11-13 09:40:02,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:02,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:10,287][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2025-11-13 09:40:11,815][__main__][INFO] - Number of regex retries in iteration 265: 1 [2025-11-13 09:40:11,816][__main__][INFO] - agents played in iteration 265 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:40:12,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:12,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:12,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:12,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:12,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:12,372][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:13,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:14,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:15,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:16,009][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:16,335][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:16,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:17,313][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:17,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:17,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:18,289][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:18,940][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:19,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:19,594][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:19,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:20,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:20,572][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:20,903][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:21,229][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:21,880][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:22,209][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:22,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:23,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:24,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:24,928][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:24,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:24,931][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:25,883][__main__][INFO] - Iteration 266 took 22s (38.58% Gen, 57.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 47s. Estimated total time: 19h 5m 23s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 53s. [2025-11-13 09:40:25,885][__main__][INFO] - Starting iteration 266. [2025-11-13 09:40:25,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:25,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:34,731][__main__][INFO] - Number of regex retries in iteration 266: 0 [2025-11-13 09:40:34,731][__main__][INFO] - agents played in iteration 266 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:40:35,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:35,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:35,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:35,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:35,286][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:35,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:36,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:36,647][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:37,297][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:37,622][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:37,948][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:38,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:38,924][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:39,249][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:39,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:39,901][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:40,228][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:40,554][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:40,880][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:41,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:43,484][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:43,810][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:44,137][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:44,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:44,790][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:45,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:45,446][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:45,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:46,102][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:46,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:47,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:47,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:47,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:47,872][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:48,823][__main__][INFO] - Iteration 267 took 22s (38.55% Gen, 57.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 45s. Estimated total time: 19h 6m 45s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 7s. [2025-11-13 09:40:48,825][__main__][INFO] - Starting iteration 267. [2025-11-13 09:40:48,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:48,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:57,732][__main__][INFO] - Number of regex retries in iteration 267: 0 [2025-11-13 09:40:57,733][__main__][INFO] - agents played in iteration 267 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:40:58,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:58,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:58,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:58,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:58,286][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:58,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:59,324][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:59,977][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:00,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:00,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:00,955][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:01,279][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:01,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:01,929][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:02,255][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:02,580][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:02,907][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:03,232][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:03,557][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:03,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:04,209][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:04,535][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:04,864][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:05,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:05,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:06,170][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:06,823][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:07,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:07,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:08,133][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:08,458][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:08,784][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:09,110][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:09,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:10,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:10,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:10,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:10,869][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:11,816][__main__][INFO] - Iteration 268 took 22s (38.73% Gen, 57.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 32m 1s. Estimated total time: 19h 9m 24s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 34s. [2025-11-13 09:41:11,818][__main__][INFO] - Starting iteration 268. [2025-11-13 09:41:11,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:11,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:20,285][__main__][INFO] - Number of regex retries in iteration 268: 0 [2025-11-13 09:41:20,286][__main__][INFO] - agents played in iteration 268 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:41:20,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:20,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:20,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:20,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:20,846][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:20,846][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:21,882][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:22,209][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:22,537][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:22,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:24,489][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:24,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:25,139][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:25,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:26,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:26,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:26,765][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:27,417][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:27,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:28,069][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:28,395][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:28,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:29,373][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:29,700][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:30,028][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:30,681][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:31,006][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:31,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:31,661][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:31,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:32,723][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:33,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:33,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:33,459][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:34,444][__main__][INFO] - Iteration 269 took 22s (37.42% Gen, 58.23% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 13m 25s. Estimated total time: 18h 51m 10s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 42s, 500 more iterations: 3h 8m 31s. [2025-11-13 09:41:34,446][__main__][INFO] - Starting iteration 269. [2025-11-13 09:41:34,449][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:34,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:43,163][__main__][INFO] - Number of regex retries in iteration 269: 0 [2025-11-13 09:41:43,164][__main__][INFO] - agents played in iteration 269 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:41:43,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:43,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:43,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:43,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:43,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:43,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:44,757][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:45,083][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:45,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:46,067][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:46,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:46,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:47,042][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:47,692][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:48,017][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:48,342][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:48,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:49,645][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:49,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:50,296][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:50,621][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:50,946][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:51,270][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:51,595][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:52,250][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:52,899][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:53,874][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:54,199][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:54,525][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:54,850][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:55,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:56,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:56,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:56,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:57,270][__main__][INFO] - Iteration 270 took 22s (38.18% Gen, 57.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 22m 56s. Estimated total time: 19h 1m 4s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 2s, 500 more iterations: 3h 10m 10s. [2025-11-13 09:41:57,272][__main__][INFO] - Starting iteration 270. [2025-11-13 09:41:57,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:57,275][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:05,891][__main__][INFO] - Number of regex retries in iteration 270: 0 [2025-11-13 09:42:05,892][__main__][INFO] - agents played in iteration 270 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:42:06,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:06,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:06,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:06,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:06,427][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:06,427][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:07,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:08,113][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:08,438][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:08,770][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:09,096][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:09,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:09,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:10,073][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:10,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:12,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:12,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:13,980][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:14,305][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:14,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:14,960][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:15,608][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:15,934][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:16,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:16,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:17,241][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:17,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:18,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:19,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:19,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:19,018][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:20,901][__main__][INFO] - Iteration 271 took 23s (36.47% Gen, 55.56% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 2m 49s. Estimated total time: 19h 41m 20s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 53s. [2025-11-13 09:42:20,903][__main__][INFO] - Starting iteration 271. [2025-11-13 09:42:20,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:42:20,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:30,030][__main__][INFO] - Number of regex retries in iteration 271: 0 [2025-11-13 09:42:30,031][__main__][INFO] - agents played in iteration 271 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:42:30,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:30,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:30,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:30,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:30,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:30,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:31,331][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:31,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:31,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:32,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:32,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:33,583][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:34,233][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:34,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:35,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:35,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:35,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:36,184][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:36,509][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:36,834][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:37,160][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:37,485][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:37,813][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:39,120][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:39,449][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:39,779][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:40,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:40,759][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:41,086][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:41,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:42,440][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:43,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:43,167][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:43,169][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:44,146][__main__][INFO] - Iteration 272 took 23s (39.26% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 8s. Estimated total time: 19h 22m 3s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 40s. [2025-11-13 09:42:44,148][__main__][INFO] - Starting iteration 272. [2025-11-13 09:42:44,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:42:44,152][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:53,089][__main__][INFO] - Number of regex retries in iteration 272: 0 [2025-11-13 09:42:53,090][__main__][INFO] - agents played in iteration 272 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:42:53,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:53,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:53,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:53,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:53,635][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:53,636][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:54,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:54,991][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:55,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:55,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:56,621][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:56,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:57,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:57,923][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:58,247][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:58,574][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:58,898][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:59,223][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:59,548][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:59,873][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:00,198][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:00,524][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:00,849][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:01,175][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:01,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:01,825][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:02,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:02,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:03,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:03,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:04,429][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:04,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:05,432][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:06,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:06,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:06,168][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:07,153][__main__][INFO] - Iteration 273 took 23s (38.85% Gen, 56.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 30m 49s. Estimated total time: 19h 10m 7s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 41s. [2025-11-13 09:43:07,155][__main__][INFO] - Starting iteration 273. [2025-11-13 09:43:07,159][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:07,159][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:15,832][__main__][INFO] - Number of regex retries in iteration 273: 0 [2025-11-13 09:43:15,833][__main__][INFO] - agents played in iteration 273 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:43:16,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:16,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:16,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:16,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:16,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:16,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:17,093][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:17,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:17,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:18,041][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:18,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:19,020][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:19,346][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:19,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:19,996][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:20,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:20,652][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:20,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:21,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:21,627][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:21,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:22,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:22,929][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:23,253][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:23,579][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:23,902][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:25,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:25,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:26,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:26,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:26,849][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:27,179][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:27,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:28,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:28,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:28,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:28,943][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:29,898][__main__][INFO] - Iteration 274 took 22s (38.14% Gen, 57.65% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 17m 18s. Estimated total time: 18h 56m 59s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 29s. [2025-11-13 09:43:29,900][__main__][INFO] - Starting iteration 274. [2025-11-13 09:43:29,903][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:29,904][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:38,578][__main__][INFO] - Number of regex retries in iteration 274: 0 [2025-11-13 09:43:38,579][__main__][INFO] - agents played in iteration 274 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:43:39,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:39,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:39,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:39,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:39,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:39,119][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:40,239][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:40,892][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:41,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:41,544][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:41,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:42,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:42,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:42,851][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:43,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:43,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:44,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:44,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:44,808][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:45,134][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:45,458][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:45,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:46,435][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:46,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:47,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:47,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:48,060][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:48,710][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:50,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:50,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:51,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:51,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:51,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:51,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:52,717][__main__][INFO] - Iteration 275 took 22s (38.02% Gen, 57.78% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 20m 40s. Estimated total time: 19h 0m 43s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 1s, 500 more iterations: 3h 10m 7s. [2025-11-13 09:43:52,719][__main__][INFO] - Starting iteration 275. [2025-11-13 09:43:52,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:52,723][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:01,740][__main__][INFO] - Number of regex retries in iteration 275: 0 [2025-11-13 09:44:01,741][__main__][INFO] - agents played in iteration 275 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:44:02,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:02,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:02,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:02,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:02,289][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:02,290][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:02,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:03,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:03,600][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:04,252][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:04,578][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:04,904][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:05,228][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:05,553][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:05,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:06,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:06,530][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:06,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:07,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:07,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:08,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:08,809][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:09,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:10,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:10,762][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:11,087][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:11,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:11,739][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:12,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:13,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:13,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:14,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:14,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:14,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:14,822][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:15,778][__main__][INFO] - Iteration 276 took 23s (39.11% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 22s. Estimated total time: 19h 12m 48s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 8s. [2025-11-13 09:44:15,780][__main__][INFO] - Starting iteration 276. [2025-11-13 09:44:15,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:15,784][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:24,298][__main__][INFO] - Number of regex retries in iteration 276: 0 [2025-11-13 09:44:24,299][__main__][INFO] - agents played in iteration 276 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:44:24,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:24,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:24,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:24,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:24,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:24,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:26,499][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:26,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:27,151][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:27,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:28,131][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:28,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:28,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:29,110][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:29,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:29,765][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:30,417][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:30,742][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:31,067][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:31,392][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:31,718][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:32,368][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:32,694][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:33,019][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:34,969][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:35,621][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:35,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:36,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:37,363][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:37,365][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:37,366][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:38,315][__main__][INFO] - Iteration 277 took 22s (37.78% Gen, 57.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 5m 48s. Estimated total time: 18h 46m 37s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 33s, 500 more iterations: 3h 7m 46s. [2025-11-13 09:44:38,317][__main__][INFO] - Starting iteration 277. [2025-11-13 09:44:38,321][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:38,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:47,240][__main__][INFO] - Number of regex retries in iteration 277: 0 [2025-11-13 09:44:47,241][__main__][INFO] - agents played in iteration 277 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:44:47,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:47,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:47,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:47,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:47,783][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:47,783][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:48,503][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:49,126][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:49,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:49,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:50,426][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:50,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:51,077][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:51,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:51,727][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:52,054][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:52,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:53,038][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:53,365][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:53,691][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:54,017][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:54,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:54,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:55,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:56,293][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:56,619][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:56,945][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:57,269][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:57,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:58,246][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:58,570][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:58,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:59,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:00,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:00,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:00,341][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:01,311][__main__][INFO] - Iteration 278 took 22s (38.79% Gen, 56.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 19s. Estimated total time: 19h 9m 31s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 35s. [2025-11-13 09:45:01,313][__main__][INFO] - Starting iteration 278. [2025-11-13 09:45:01,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:01,316][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:10,013][__main__][INFO] - Number of regex retries in iteration 278: 0 [2025-11-13 09:45:10,014][__main__][INFO] - agents played in iteration 278 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:45:10,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:10,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:10,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:10,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:10,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:10,567][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:11,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:11,586][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:11,912][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:12,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:12,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:13,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:13,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:13,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:14,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:15,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:15,501][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:16,158][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:16,489][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:16,816][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:17,143][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:17,797][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:18,122][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:19,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:19,747][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:20,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:20,397][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:21,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:21,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:22,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:23,121][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:23,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:23,124][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:24,232][__main__][INFO] - Iteration 279 took 22s (37.95% Gen, 57.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 24m 16s. Estimated total time: 19h 5m 51s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 58s. [2025-11-13 09:45:24,235][__main__][INFO] - Starting iteration 279. [2025-11-13 09:45:24,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:24,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:32,912][__main__][INFO] - Number of regex retries in iteration 279: 0 [2025-11-13 09:45:32,913][__main__][INFO] - agents played in iteration 279 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:45:33,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:33,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:33,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:33,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:33,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:33,447][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:34,165][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:34,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:35,111][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:35,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:36,413][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:37,067][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:37,393][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:38,045][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:38,371][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:39,023][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:39,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:40,007][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:40,334][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:40,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:40,988][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:41,638][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:41,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:42,615][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:42,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:43,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:43,589][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:43,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:44,239][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:44,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:45,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:45,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:45,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:45,998][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:46,996][__main__][INFO] - Iteration 280 took 22s (38.11% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 1s. Estimated total time: 18h 57m 59s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 55s, 500 more iterations: 3h 9m 39s. [2025-11-13 09:45:46,998][__main__][INFO] - Starting iteration 280. [2025-11-13 09:45:47,002][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:47,002][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:55,747][__main__][INFO] - Number of regex retries in iteration 280: 0 [2025-11-13 09:45:55,748][__main__][INFO] - agents played in iteration 280 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:45:56,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:56,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:56,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:56,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:56,292][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:56,293][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:57,012][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:57,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:57,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:57,960][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:58,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:58,939][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:59,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:00,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:00,571][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:01,222][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:01,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:01,882][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:02,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:02,867][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:03,192][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:03,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:03,845][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:04,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:04,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:04,821][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:05,146][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:05,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:06,124][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:06,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:06,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:07,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:08,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:08,843][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:08,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:08,847][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:10,694][__main__][INFO] - Iteration 281 took 23s (36.91% Gen, 55.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 2m 19s. Estimated total time: 19h 44m 40s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 26s. [2025-11-13 09:46:10,696][__main__][INFO] - Starting iteration 281. [2025-11-13 09:46:10,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:46:10,700][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:19,930][__main__][INFO] - Number of regex retries in iteration 281: 0 [2025-11-13 09:46:19,931][__main__][INFO] - agents played in iteration 281 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:46:20,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:20,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:20,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:20,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:20,459][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:20,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:21,156][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:21,451][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:21,776][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:22,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:22,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:22,752][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:23,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:23,728][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:24,053][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:24,381][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:25,033][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:25,359][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:25,686][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:26,013][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:26,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:26,997][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:27,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:27,651][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:27,980][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:28,633][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:28,959][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:29,283][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:29,609][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:29,935][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:30,260][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:30,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:30,911][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:31,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:32,289][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:33,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:33,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:33,007][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:33,886][__main__][INFO] - Iteration 282 took 23s (39.81% Gen, 56.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 37s. Estimated total time: 19h 19m 22s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 13s. [2025-11-13 09:46:33,889][__main__][INFO] - Starting iteration 282. [2025-11-13 09:46:33,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:46:33,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:43,411][__main__][INFO] - Number of regex retries in iteration 282: 0 [2025-11-13 09:46:43,411][__main__][INFO] - agents played in iteration 282 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:46:43,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:43,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:43,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:43,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:43,952][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:43,952][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:44,664][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:45,611][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:45,936][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:46,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:46,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:46,911][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:47,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:47,567][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:47,893][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:48,219][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:48,545][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:48,872][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:49,202][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:49,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:49,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:50,186][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:50,512][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:50,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:51,823][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:52,478][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:52,803][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:53,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:53,453][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:53,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:54,105][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:54,755][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:55,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:55,802][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:56,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:56,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:56,515][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:57,382][__main__][INFO] - Iteration 283 took 23s (40.52% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 51m 27s. Estimated total time: 19h 34m 35s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 45s. [2025-11-13 09:46:57,384][__main__][INFO] - Starting iteration 283. [2025-11-13 09:46:57,387][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:46:57,388][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:06,146][__main__][INFO] - Number of regex retries in iteration 283: 0 [2025-11-13 09:47:06,147][__main__][INFO] - agents played in iteration 283 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:47:06,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:06,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:06,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:06,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:06,686][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:06,686][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:07,688][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:08,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:08,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:09,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:10,936][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:11,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:12,891][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:13,869][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:14,195][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:14,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:15,177][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:15,837][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:16,162][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:16,488][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:16,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:17,141][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:17,467][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:17,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:18,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:19,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:19,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:19,217][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:20,086][__main__][INFO] - Iteration 284 took 22s (38.59% Gen, 57.57% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 11m 28s. Estimated total time: 18h 54m 59s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 9s. [2025-11-13 09:47:20,088][__main__][INFO] - Starting iteration 284. [2025-11-13 09:47:20,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:20,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:29,665][__main__][INFO] - Number of regex retries in iteration 284: 0 [2025-11-13 09:47:29,665][__main__][INFO] - agents played in iteration 284 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:47:30,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:30,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:30,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:30,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:30,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:30,212][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:30,905][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:31,201][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:31,526][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:31,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:32,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:32,505][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:32,835][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:33,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:33,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:34,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:34,462][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:34,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:35,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:35,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:37,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:37,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:38,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:38,719][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:39,046][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:39,699][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:40,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:40,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:41,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:42,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:42,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:42,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:42,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:43,675][__main__][INFO] - Iteration 285 took 23s (40.59% Gen, 55.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 19s. Estimated total time: 19h 39m 14s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 32s. [2025-11-13 09:47:43,677][__main__][INFO] - Starting iteration 285. [2025-11-13 09:47:43,680][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:43,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:53,099][__main__][INFO] - Number of regex retries in iteration 285: 0 [2025-11-13 09:47:53,099][__main__][INFO] - agents played in iteration 285 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:47:53,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:53,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:53,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:54,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:54,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:54,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:55,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:55,605][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:55,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:56,907][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:57,233][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:57,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:58,220][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:59,521][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:59,846][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:00,172][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:00,498][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:00,824][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:01,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:01,804][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:02,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:02,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:02,782][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:03,108][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:03,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:04,088][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:04,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:05,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:06,167][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:06,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:06,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:07,055][__main__][INFO] - Iteration 286 took 23s (40.29% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 29s. Estimated total time: 19h 28m 47s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 47s. [2025-11-13 09:48:07,056][__main__][INFO] - Starting iteration 286. [2025-11-13 09:48:07,059][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:07,060][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:16,413][__main__][INFO] - Number of regex retries in iteration 286: 0 [2025-11-13 09:48:16,414][__main__][INFO] - agents played in iteration 286 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:48:16,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:16,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:16,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:16,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:16,943][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:16,944][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:17,940][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:18,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:19,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:19,570][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:19,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:20,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:20,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:21,195][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:21,520][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:22,492][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:23,143][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:23,468][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:23,793][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:24,119][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:24,444][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:24,772][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:25,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:25,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:26,412][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:26,737][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:27,063][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:27,389][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:27,714][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:28,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:28,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:29,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:29,454][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:29,455][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:30,365][__main__][INFO] - Iteration 287 took 23s (40.13% Gen, 55.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 37s. Estimated total time: 19h 25m 19s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 13s. [2025-11-13 09:48:30,367][__main__][INFO] - Starting iteration 287. [2025-11-13 09:48:30,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:30,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:39,865][__main__][INFO] - Number of regex retries in iteration 287: 0 [2025-11-13 09:48:39,866][__main__][INFO] - agents played in iteration 287 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:48:40,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:40,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:40,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:40,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:40,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:40,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:41,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:41,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:41,707][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:42,361][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:42,687][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:43,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:43,997][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:44,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:44,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:44,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:45,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:45,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:45,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:46,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:46,605][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:47,254][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:47,908][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:48,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:48,893][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:49,218][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:49,547][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:50,530][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:50,856][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:51,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:52,226][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:52,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:52,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:52,925][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:53,855][__main__][INFO] - Iteration 288 took 23s (40.43% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 14s. Estimated total time: 19h 34m 18s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 43s. [2025-11-13 09:48:53,857][__main__][INFO] - Starting iteration 288. [2025-11-13 09:48:53,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:53,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:03,705][__main__][INFO] - Number of regex retries in iteration 288: 0 [2025-11-13 09:49:03,706][__main__][INFO] - agents played in iteration 288 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:49:04,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:04,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:04,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:04,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:04,255][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:04,256][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:04,951][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:05,246][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:05,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:05,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:06,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:06,885][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:07,210][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:07,537][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:08,191][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:08,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:08,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:09,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:09,815][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:10,139][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:10,464][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:10,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:11,115][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:11,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:12,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:12,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:13,092][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:13,420][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:13,748][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:14,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:14,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:15,052][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:15,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:16,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:16,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:16,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:16,812][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:17,738][__main__][INFO] - Iteration 289 took 23s (41.23% Gen, 54.89% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 30s. Estimated total time: 19h 53m 59s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 59s. [2025-11-13 09:49:17,740][__main__][INFO] - Starting iteration 289. [2025-11-13 09:49:17,743][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:17,743][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:27,183][__main__][INFO] - Number of regex retries in iteration 289: 0 [2025-11-13 09:49:27,184][__main__][INFO] - agents played in iteration 289 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:49:27,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:27,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:27,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:27,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:27,727][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:27,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:29,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:30,340][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:30,991][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:31,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:31,970][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:32,301][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:32,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:32,955][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:33,281][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:33,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:34,582][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:35,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:35,566][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:35,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:36,220][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:36,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:38,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:38,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:39,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:40,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:40,321][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:40,324][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:41,204][__main__][INFO] - Iteration 290 took 23s (40.24% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 15s. Estimated total time: 19h 33m 7s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 31s. [2025-11-13 09:49:41,207][__main__][INFO] - Starting iteration 290. [2025-11-13 09:49:41,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:41,211][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:51,191][__main__][INFO] - Number of regex retries in iteration 290: 0 [2025-11-13 09:49:51,192][__main__][INFO] - agents played in iteration 290 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:49:51,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:51,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:51,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:51,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:51,727][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:51,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:52,731][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:53,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:54,033][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:54,358][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:54,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:55,009][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:55,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:55,985][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:56,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:56,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:57,288][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:57,615][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:58,267][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:58,919][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:59,244][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:59,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:59,900][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:00,553][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:00,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:01,535][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:01,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:02,192][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:02,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:03,559][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:04,269][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:04,271][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:04,273][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:06,083][__main__][INFO] - Iteration 291 took 24s (40.13% Gen, 52.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 57m 23s. Estimated total time: 20h 43m 40s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 27s, 500 more iterations: 3h 27m 16s. [2025-11-13 09:50:06,085][__main__][INFO] - Starting iteration 291. [2025-11-13 09:50:06,088][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:06,088][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:16,119][__main__][INFO] - Number of regex retries in iteration 291: 0 [2025-11-13 09:50:16,120][__main__][INFO] - agents played in iteration 291 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:50:16,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:16,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:16,664][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:17,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:18,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:18,626][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:18,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:19,282][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:19,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:19,931][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:20,255][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:20,581][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:20,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:21,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:22,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:22,532][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:22,860][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:23,840][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:24,166][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:24,490][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:24,817][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:25,469][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:25,796][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:26,123][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:26,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:27,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:27,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:28,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:29,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:29,199][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:29,201][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:30,105][__main__][INFO] - Iteration 292 took 24s (41.77% Gen, 54.46% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 14m 14s. Estimated total time: 20h 0m 55s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 9s. [2025-11-13 09:50:30,107][__main__][INFO] - Starting iteration 292. [2025-11-13 09:50:30,110][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:30,110][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:40,121][__main__][INFO] - Number of regex retries in iteration 292: 0 [2025-11-13 09:50:40,121][__main__][INFO] - agents played in iteration 292 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:50:40,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:40,650][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:40,651][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:41,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:41,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:41,990][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:42,318][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:42,644][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:43,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:43,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:43,953][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:44,279][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:44,930][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:45,258][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:45,911][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:46,562][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:47,213][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:47,863][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:48,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:48,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:48,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:49,176][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:49,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:49,828][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:50,154][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:50,480][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:50,805][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:51,130][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:51,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:52,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:53,192][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:53,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:53,196][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:54,068][__main__][INFO] - Iteration 293 took 23s (41.78% Gen, 54.57% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 10m 52s. Estimated total time: 19h 57m 57s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 55s, 500 more iterations: 3h 19m 39s. [2025-11-13 09:50:54,070][__main__][INFO] - Starting iteration 293. [2025-11-13 09:50:54,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:54,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:03,682][__main__][INFO] - Number of regex retries in iteration 293: 0 [2025-11-13 09:51:03,683][__main__][INFO] - agents played in iteration 293 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:51:04,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:04,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:04,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:04,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:04,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:04,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:04,970][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:05,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:05,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:06,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:06,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:07,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:07,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:08,198][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:08,523][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:08,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:09,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:10,150][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:10,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:10,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:11,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:11,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:11,788][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:12,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:12,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:12,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:13,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:13,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:14,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:15,054][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:15,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:16,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:16,810][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:16,811][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:16,813][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:17,709][__main__][INFO] - Iteration 294 took 23s (40.65% Gen, 55.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 23s. Estimated total time: 19h 41m 51s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 58s. [2025-11-13 09:51:17,711][__main__][INFO] - Starting iteration 294. [2025-11-13 09:51:17,713][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:17,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:28,144][__main__][INFO] - Number of regex retries in iteration 294: 0 [2025-11-13 09:51:28,145][__main__][INFO] - agents played in iteration 294 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:51:28,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:28,682][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:28,682][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:29,412][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:30,365][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:30,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:31,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:31,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:31,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:31,986][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:32,312][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:32,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:32,961][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:33,286][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:33,610][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:33,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:34,258][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:34,584][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:34,911][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:35,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:35,562][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:36,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:36,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:37,200][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:37,528][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:37,853][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:38,178][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:38,504][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:38,829][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:39,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:39,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:40,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:41,236][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:41,237][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:41,239][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:42,135][__main__][INFO] - Iteration 295 took 24s (42.71% Gen, 53.62% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 33m 13s. Estimated total time: 20h 21m 6s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 42s, 500 more iterations: 3h 23m 31s. [2025-11-13 09:51:42,137][__main__][INFO] - Starting iteration 295. [2025-11-13 09:51:42,140][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:42,141][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:51,510][__main__][INFO] - Number of regex retries in iteration 295: 0 [2025-11-13 09:51:51,511][__main__][INFO] - agents played in iteration 295 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:51:51,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:52,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:52,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:52,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:52,382][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:52,383][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:53,361][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:53,686][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:54,011][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:54,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:54,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:54,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:55,307][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:55,632][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:55,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:56,282][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:57,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:57,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:57,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:58,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:58,561][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:58,886][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:59,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:59,536][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:59,861][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:00,188][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:00,515][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:00,847][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:01,172][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:01,498][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:02,477][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:02,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:03,128][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:03,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:04,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:04,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:04,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:04,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:05,793][__main__][INFO] - Iteration 296 took 23s (39.61% Gen, 56.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 23s. Estimated total time: 19h 42m 40s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 6s. [2025-11-13 09:52:05,795][__main__][INFO] - Starting iteration 296. [2025-11-13 09:52:05,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:05,798][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:15,927][__main__][INFO] - Number of regex retries in iteration 296: 0 [2025-11-13 09:52:15,928][__main__][INFO] - agents played in iteration 296 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:52:16,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:16,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:16,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:16,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:16,453][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:16,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:17,427][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:19,056][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:19,380][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:19,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:20,028][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:20,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:20,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:20,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:22,953][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:23,280][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:23,605][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:23,931][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:24,256][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:24,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:25,233][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:25,557][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:26,207][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:26,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:27,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:27,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:28,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:28,930][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:28,932][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:28,934][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:29,821][__main__][INFO] - Iteration 297 took 24s (42.16% Gen, 54.14% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 12m 32s. Estimated total time: 20h 1m 13s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 2s, 500 more iterations: 3h 20m 12s. [2025-11-13 09:52:29,823][__main__][INFO] - Starting iteration 297. [2025-11-13 09:52:29,825][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:29,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:39,247][__main__][INFO] - Number of regex retries in iteration 297: 0 [2025-11-13 09:52:39,248][__main__][INFO] - agents played in iteration 297 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:52:39,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:40,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:40,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:40,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:40,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:40,126][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:40,817][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:41,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:41,762][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:42,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:42,733][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:43,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:43,381][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:43,705][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:44,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:44,352][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:45,001][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:45,325][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:45,651][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:45,975][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:46,301][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:46,951][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:48,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:48,587][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:48,914][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:49,240][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:49,565][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:49,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:50,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:50,867][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:51,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:51,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:52,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:52,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:52,611][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:53,504][__main__][INFO] - Iteration 298 took 23s (39.79% Gen, 56.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 54s. Estimated total time: 19h 43m 59s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 19s. [2025-11-13 09:52:53,506][__main__][INFO] - Starting iteration 298. [2025-11-13 09:52:53,509][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:53,509][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:02,999][__main__][INFO] - Number of regex retries in iteration 298: 0 [2025-11-13 09:53:02,999][__main__][INFO] - agents played in iteration 298 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:53:03,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:03,543][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:03,544][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:04,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:04,529][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:04,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:05,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:05,506][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:05,831][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:06,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:06,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:06,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:07,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:07,456][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:08,427][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:08,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:09,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:09,401][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:09,728][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:10,058][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:10,385][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:10,709][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:11,037][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:11,365][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:11,697][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:12,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:12,675][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:13,651][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:13,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:14,302][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:14,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:15,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:16,049][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:16,050][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:16,052][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:16,932][__main__][INFO] - Iteration 299 took 23s (40.51% Gen, 55.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 43s. Estimated total time: 19h 31m 11s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 11s. [2025-11-13 09:53:16,934][__main__][INFO] - Starting iteration 299. [2025-11-13 09:53:16,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:16,937][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:26,004][__main__][INFO] - Number of regex retries in iteration 299: 0 [2025-11-13 09:53:26,004][__main__][INFO] - agents played in iteration 299 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:53:26,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:26,540][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:26,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:27,265][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:27,561][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:27,886][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:28,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:28,537][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:28,862][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:29,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:29,838][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:30,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:30,490][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:30,817][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:31,792][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:32,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:33,426][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:33,758][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:34,414][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:34,740][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:35,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:35,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:36,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:36,695][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:37,020][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:37,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:37,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:38,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:39,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:39,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:39,081][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:39,965][__main__][INFO] - Iteration 300 took 23s (39.37% Gen, 56.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 35s. Estimated total time: 19h 11m 26s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 54s. [2025-11-13 09:53:39,967][__main__][INFO] - Starting iteration 300. [2025-11-13 09:53:39,970][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:39,970][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:48,915][__main__][INFO] - Number of regex retries in iteration 300: 0 [2025-11-13 09:53:48,916][__main__][INFO] - agents played in iteration 300 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:53:49,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:49,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:49,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:49,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:49,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:49,449][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:50,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:51,099][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:51,426][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:51,751][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:52,077][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:52,403][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:52,727][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:53,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:53,701][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:54,026][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:54,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:54,676][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:55,326][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:55,650][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:55,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:56,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:56,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:57,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:57,605][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:58,255][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:58,585][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:58,912][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:59,240][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:59,567][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:59,892][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:00,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:00,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:01,261][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:01,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:01,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:01,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:03,680][__main__][INFO] - Iteration 301 took 23s (37.72% Gen, 55.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 55m 20s. Estimated total time: 19h 45m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 35s. [2025-11-13 09:54:03,683][__main__][INFO] - Starting iteration 301. [2025-11-13 09:54:03,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:03,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:13,789][__main__][INFO] - Number of regex retries in iteration 301: 0 [2025-11-13 09:54:13,790][__main__][INFO] - agents played in iteration 301 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:54:14,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:14,332][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:14,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:15,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:16,346][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:16,672][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:17,325][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:17,655][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:17,979][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:18,304][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:18,629][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:18,959][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:19,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:19,614][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:19,937][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:20,592][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:20,916][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:21,569][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:22,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:22,553][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:22,880][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:23,208][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:23,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:23,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:24,185][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:24,510][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:24,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:25,160][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:25,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:26,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:26,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:26,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:26,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:27,945][__main__][INFO] - Iteration 302 took 24s (41.65% Gen, 54.31% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 22m 21s. Estimated total time: 20h 13m 0s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 26s, 500 more iterations: 3h 22m 10s. [2025-11-13 09:54:27,947][__main__][INFO] - Starting iteration 302. [2025-11-13 09:54:27,950][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:27,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:37,943][__main__][INFO] - Number of regex retries in iteration 302: 0 [2025-11-13 09:54:37,943][__main__][INFO] - agents played in iteration 302 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:54:38,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:38,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:38,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:39,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:40,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:40,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:41,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:41,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:42,084][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:43,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:44,353][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:45,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:45,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:45,651][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:45,976][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:46,632][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:47,285][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:47,612][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:47,942][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:48,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:48,596][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:49,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:49,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:50,287][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:50,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:50,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:50,990][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:51,926][__main__][INFO] - Iteration 303 took 23s (41.68% Gen, 54.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 7m 48s. Estimated total time: 19h 58m 50s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 48s. [2025-11-13 09:54:51,928][__main__][INFO] - Starting iteration 303. [2025-11-13 09:54:51,930][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:51,931][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:02,043][__main__][INFO] - Number of regex retries in iteration 303: 0 [2025-11-13 09:55:02,043][__main__][INFO] - agents played in iteration 303 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:55:02,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:02,574][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:02,575][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:03,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:04,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:04,550][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:04,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:05,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:05,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:05,846][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:06,171][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:06,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:06,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:07,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:07,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:07,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:08,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:08,446][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:08,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:09,095][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:09,422][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:09,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:10,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:10,396][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:10,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:11,048][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:12,025][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:13,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:14,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:15,113][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:15,114][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:15,116][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:16,022][__main__][INFO] - Iteration 304 took 24s (41.97% Gen, 54.26% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 13m 12s. Estimated total time: 20h 4m 38s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 46s. [2025-11-13 09:55:16,025][__main__][INFO] - Starting iteration 304. [2025-11-13 09:55:16,027][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:16,028][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:25,717][__main__][INFO] - Number of regex retries in iteration 304: 0 [2025-11-13 09:55:25,717][__main__][INFO] - agents played in iteration 304 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:55:26,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:26,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:26,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:26,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:26,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:26,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:26,941][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:27,237][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:27,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:28,873][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:29,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:29,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:29,850][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:30,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:31,157][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:31,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:32,141][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:32,473][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:33,128][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:34,107][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:34,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:35,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:35,411][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:35,737][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:36,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:36,388][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:37,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:37,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:38,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:38,776][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:38,778][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:38,779][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:39,680][__main__][INFO] - Iteration 305 took 23s (40.96% Gen, 55.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 50s. Estimated total time: 19h 42m 41s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 6s. [2025-11-13 09:55:39,682][__main__][INFO] - Starting iteration 305. [2025-11-13 09:55:39,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:39,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:49,125][__main__][INFO] - Number of regex retries in iteration 305: 0 [2025-11-13 09:55:49,125][__main__][INFO] - agents played in iteration 305 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:55:49,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:49,667][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:49,668][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:50,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:50,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:50,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:51,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:51,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:51,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:52,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:52,611][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:52,937][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:53,588][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:53,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:54,248][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:54,576][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:56,210][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:56,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:57,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:58,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:58,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:59,476][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:59,801][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:00,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:00,781][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:01,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:02,209][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:02,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:02,213][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:03,091][__main__][INFO] - Iteration 306 took 23s (40.33% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 7s. Estimated total time: 19h 30m 21s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 3s. [2025-11-13 09:56:03,093][__main__][INFO] - Starting iteration 306. [2025-11-13 09:56:03,096][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:03,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:13,217][__main__][INFO] - Number of regex retries in iteration 306: 0 [2025-11-13 09:56:13,217][__main__][INFO] - agents played in iteration 306 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:56:13,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:13,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:13,754][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:14,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:15,066][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:15,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:16,378][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:16,705][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:17,031][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:17,688][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:18,017][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:18,344][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:18,673][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:19,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:19,661][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:19,987][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:20,313][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:20,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:20,970][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:21,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:21,624][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:21,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:22,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:22,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:22,929][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:23,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:23,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:23,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:24,240][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:24,566][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:24,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:25,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:26,313][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:26,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:26,316][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:27,282][__main__][INFO] - Iteration 307 took 24s (41.85% Gen, 54.16% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 16m 43s. Estimated total time: 20h 9m 21s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 18s, 500 more iterations: 3h 21m 33s. [2025-11-13 09:56:27,284][__main__][INFO] - Starting iteration 307. [2025-11-13 09:56:27,287][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:27,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:36,950][__main__][INFO] - Number of regex retries in iteration 307: 0 [2025-11-13 09:56:36,950][__main__][INFO] - agents played in iteration 307 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:56:37,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:37,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:37,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:37,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:37,489][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:37,489][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:38,165][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:38,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:39,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:39,766][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:40,091][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:40,742][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:41,068][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:41,396][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:41,721][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:42,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:42,373][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:42,703][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:43,031][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:43,358][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:44,010][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:44,335][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:44,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:45,315][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:45,641][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:45,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:46,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:46,618][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:47,269][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:47,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:48,249][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:48,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:49,303][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:50,009][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:50,011][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:50,012][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:50,887][__main__][INFO] - Iteration 308 took 23s (40.94% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 0s. Estimated total time: 19h 40m 2s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 40s. [2025-11-13 09:56:50,889][__main__][INFO] - Starting iteration 308. [2025-11-13 09:56:50,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:50,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:00,587][__main__][INFO] - Number of regex retries in iteration 308: 0 [2025-11-13 09:57:00,588][__main__][INFO] - agents played in iteration 308 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:57:01,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:01,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:01,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:01,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:01,134][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:01,134][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:02,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:02,451][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:02,780][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:03,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:03,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:04,084][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:04,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:04,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:05,067][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:05,393][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:05,720][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:06,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:06,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:06,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:07,678][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:08,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:08,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:09,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:09,636][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:11,264][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:11,590][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:12,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:12,979][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:13,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:13,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:13,680][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:14,568][__main__][INFO] - Iteration 309 took 23s (40.95% Gen, 55.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 28s. Estimated total time: 19h 43m 53s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 18s. [2025-11-13 09:57:14,570][__main__][INFO] - Starting iteration 309. [2025-11-13 09:57:14,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:14,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:24,039][__main__][INFO] - Number of regex retries in iteration 309: 0 [2025-11-13 09:57:24,040][__main__][INFO] - agents played in iteration 309 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:57:24,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:24,570][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:25,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:25,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:25,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:26,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:27,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:27,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:28,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:28,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:28,832][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:29,157][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:29,482][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:29,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:30,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:30,461][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:31,443][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:32,096][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:32,748][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:33,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:33,401][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:33,726][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:34,379][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:35,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:36,417][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:37,115][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:37,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:37,118][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:38,018][__main__][INFO] - Iteration 310 took 23s (40.37% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 29s. Estimated total time: 19h 32m 18s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 23s. [2025-11-13 09:57:38,020][__main__][INFO] - Starting iteration 310. [2025-11-13 09:57:38,022][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:38,023][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:47,948][__main__][INFO] - Number of regex retries in iteration 310: 0 [2025-11-13 09:57:47,949][__main__][INFO] - agents played in iteration 310 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:57:48,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:48,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:48,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:48,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:48,500][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:48,501][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:49,200][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:50,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:50,476][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:51,129][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:51,457][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:52,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:52,436][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:52,766][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:53,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:53,420][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:53,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:54,071][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:54,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:54,731][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:55,057][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:55,384][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:55,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:56,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:56,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:56,689][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:57,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:57,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:57,670][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:58,326][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:58,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:58,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:59,308][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:59,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:00,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:01,075][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:01,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:01,078][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:02,810][__main__][INFO] - Iteration 311 took 24s (40.04% Gen, 52.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 10s. Estimated total time: 20h 39m 24s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 18s, 500 more iterations: 3h 26m 34s. [2025-11-13 09:58:02,812][__main__][INFO] - Starting iteration 311. [2025-11-13 09:58:02,814][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:02,815][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:12,016][__main__][INFO] - Number of regex retries in iteration 311: 0 [2025-11-13 09:58:12,017][__main__][INFO] - agents played in iteration 311 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:58:12,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:12,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:12,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:12,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:12,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:12,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:13,274][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:13,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:13,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:14,223][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:14,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:14,873][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:15,198][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:15,525][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:16,502][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:17,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:18,133][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:18,785][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:20,089][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:20,741][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:21,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:21,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:21,719][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:22,045][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:22,371][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:22,697][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:23,023][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:23,349][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:23,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:24,397][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:25,099][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:25,100][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:25,102][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:25,974][__main__][INFO] - Iteration 312 took 23s (39.73% Gen, 56.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 26s. Estimated total time: 19h 18m 2s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 0s. [2025-11-13 09:58:25,976][__main__][INFO] - Starting iteration 312. [2025-11-13 09:58:25,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:25,980][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:35,524][__main__][INFO] - Number of regex retries in iteration 312: 0 [2025-11-13 09:58:35,524][__main__][INFO] - agents played in iteration 312 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:58:35,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:35,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:36,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:36,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:36,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:36,047][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:37,047][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:37,375][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:37,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:38,028][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:38,680][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:39,004][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:39,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:39,653][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:39,979][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:41,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:41,610][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:41,937][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:42,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:42,589][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:43,240][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:43,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:43,894][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:44,546][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:44,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:45,522][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:45,848][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:46,174][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:46,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:47,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:47,873][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:48,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:48,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:48,564][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:49,433][__main__][INFO] - Iteration 313 took 23s (40.69% Gen, 55.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 45s. Estimated total time: 19h 32m 45s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 27s. [2025-11-13 09:58:49,435][__main__][INFO] - Starting iteration 313. [2025-11-13 09:58:49,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:49,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:58,246][__main__][INFO] - Number of regex retries in iteration 313: 0 [2025-11-13 09:58:58,246][__main__][INFO] - agents played in iteration 313 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:58:58,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:58,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:58,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:58,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:58,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:58,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:00,422][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:00,749][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:01,076][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:01,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:01,733][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:02,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:02,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:02,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:03,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:03,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:04,019][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:04,672][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:04,999][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:05,650][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:05,976][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:06,302][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:06,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:07,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:08,260][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:09,239][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:09,565][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:09,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:10,634][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:11,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:11,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:11,331][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:12,214][__main__][INFO] - Iteration 314 took 22s (38.67% Gen, 57.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 3m 25s. Estimated total time: 18h 58m 48s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 48s. [2025-11-13 09:59:12,216][__main__][INFO] - Starting iteration 314. [2025-11-13 09:59:12,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:12,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:21,590][__main__][INFO] - Number of regex retries in iteration 314: 0 [2025-11-13 09:59:21,590][__main__][INFO] - agents played in iteration 314 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:59:22,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:22,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:22,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:22,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:22,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:22,126][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:22,814][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:23,111][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:23,765][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:24,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:25,726][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:26,052][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:26,376][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:26,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:28,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:28,334][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:28,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:28,987][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:29,312][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:29,638][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:29,965][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:30,291][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:30,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:30,941][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:31,593][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:31,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:32,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:32,894][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:33,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:33,955][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:34,650][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:34,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:34,653][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:35,586][__main__][INFO] - Iteration 315 took 23s (40.10% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 35s. Estimated total time: 19h 28m 22s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 43s. [2025-11-13 09:59:35,588][__main__][INFO] - Starting iteration 315. [2025-11-13 09:59:35,590][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:35,591][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:44,718][__main__][INFO] - Number of regex retries in iteration 315: 0 [2025-11-13 09:59:44,719][__main__][INFO] - agents played in iteration 315 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 09:59:45,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:45,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:45,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:45,967][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:46,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:46,587][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:46,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:47,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:47,569][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:47,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:48,563][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:49,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:49,549][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:49,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:50,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:50,854][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:51,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:51,507][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:51,832][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:52,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:52,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:53,139][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:53,464][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:53,789][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:54,114][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:54,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:54,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:55,091][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:55,743][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:56,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:57,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:57,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:57,847][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:57,848][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:58,835][__main__][INFO] - Iteration 316 took 23s (39.27% Gen, 56.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 26m 7s. Estimated total time: 19h 22m 17s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 42s. [2025-11-13 09:59:58,837][__main__][INFO] - Starting iteration 316. [2025-11-13 09:59:58,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:58,841][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:08,184][__main__][INFO] - Number of regex retries in iteration 316: 0 [2025-11-13 10:00:08,185][__main__][INFO] - agents played in iteration 316 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:00:08,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:08,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:08,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:08,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:08,724][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:08,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:09,742][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:10,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:10,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:10,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:11,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:11,694][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:12,024][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:12,352][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:12,680][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:13,006][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:13,336][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:13,663][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:13,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:14,640][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:14,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:15,292][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:15,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:16,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:16,594][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:16,919][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:17,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:17,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:17,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:18,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:18,870][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:19,195][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:19,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:20,586][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:21,309][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:21,311][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:21,313][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:22,291][__main__][INFO] - Iteration 317 took 23s (39.84% Gen, 55.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 2s. Estimated total time: 19h 32m 36s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 26s. [2025-11-13 10:00:22,293][__main__][INFO] - Starting iteration 317. [2025-11-13 10:00:22,297][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:22,298][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:31,432][__main__][INFO] - Number of regex retries in iteration 317: 0 [2025-11-13 10:00:31,432][__main__][INFO] - agents played in iteration 317 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:00:31,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:31,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:31,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:31,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:31,976][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:31,976][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:32,704][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:33,000][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:33,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:33,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:34,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:34,956][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:35,282][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:35,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:36,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:37,232][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:37,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:37,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:38,211][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:38,862][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:39,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:39,513][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:40,163][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:40,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:40,813][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:41,138][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:41,464][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:42,114][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:42,439][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:42,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:43,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:43,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:44,527][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:44,528][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:44,530][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:45,512][__main__][INFO] - Iteration 318 took 23s (39.34% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 50s. Estimated total time: 19h 20m 47s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 27s. [2025-11-13 10:00:45,514][__main__][INFO] - Starting iteration 318. [2025-11-13 10:00:45,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:45,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:54,838][__main__][INFO] - Number of regex retries in iteration 318: 0 [2025-11-13 10:00:54,839][__main__][INFO] - agents played in iteration 318 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:00:55,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:55,376][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:55,377][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:56,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:56,393][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:56,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:57,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:57,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:58,350][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:58,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:59,000][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:59,327][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:59,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:00,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:00,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:00,956][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:01,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:01,932][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:02,257][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:02,582][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:02,907][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:03,560][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:03,886][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:04,212][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:04,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:04,863][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:05,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:05,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:06,166][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:06,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:07,218][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:07,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:07,965][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:07,967][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:08,929][__main__][INFO] - Iteration 319 took 23s (39.81% Gen, 56.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 18s. Estimated total time: 19h 30m 38s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 6s. [2025-11-13 10:01:08,931][__main__][INFO] - Starting iteration 319. [2025-11-13 10:01:08,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:08,935][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:18,706][__main__][INFO] - Number of regex retries in iteration 319: 0 [2025-11-13 10:01:18,706][__main__][INFO] - agents played in iteration 319 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:01:19,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:19,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:19,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:19,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:19,254][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:19,254][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:19,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:20,267][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:20,597][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:20,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:22,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:22,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:23,212][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:23,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:23,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:24,188][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:24,515][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:24,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:25,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:25,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:26,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:27,776][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:28,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:28,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:28,751][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:29,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:29,730][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:30,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:30,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:31,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:31,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:31,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:31,840][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:32,774][__main__][INFO] - Iteration 320 took 23s (40.99% Gen, 55.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 18s. Estimated total time: 19h 52m 2s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 44s, 500 more iterations: 3h 18m 40s. [2025-11-13 10:01:32,776][__main__][INFO] - Starting iteration 320. [2025-11-13 10:01:32,779][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:32,779][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:41,821][__main__][INFO] - Number of regex retries in iteration 320: 0 [2025-11-13 10:01:41,822][__main__][INFO] - agents played in iteration 320 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:01:42,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:42,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:42,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:43,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:43,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:43,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:44,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:44,370][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:44,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:46,003][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:46,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:47,958][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:48,283][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:48,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:49,261][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:49,587][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:49,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:50,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:50,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:51,540][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:51,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:52,192][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:52,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:53,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:54,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:54,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:54,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:54,963][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:56,748][__main__][INFO] - Iteration 321 took 23s (37.72% Gen, 54.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 0m 24s. Estimated total time: 19h 58m 31s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 45s. [2025-11-13 10:01:56,750][__main__][INFO] - Starting iteration 321. [2025-11-13 10:01:56,753][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:01:56,753][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:06,479][__main__][INFO] - Number of regex retries in iteration 321: 0 [2025-11-13 10:02:06,479][__main__][INFO] - agents played in iteration 321 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:02:06,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:06,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:06,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:07,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:07,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:07,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:07,729][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:08,024][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:08,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:08,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:09,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:09,656][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:09,981][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:10,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:10,633][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:11,284][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:11,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:11,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:12,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:12,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:13,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:13,885][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:14,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:16,163][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:17,140][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:17,470][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:18,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:18,859][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:19,599][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:19,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:19,602][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:20,559][__main__][INFO] - Iteration 322 took 23s (40.85% Gen, 55.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 51m 50s. Estimated total time: 19h 50m 22s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 23s. [2025-11-13 10:02:20,562][__main__][INFO] - Starting iteration 322. [2025-11-13 10:02:20,564][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:20,565][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:29,728][__main__][INFO] - Number of regex retries in iteration 322: 0 [2025-11-13 10:02:29,729][__main__][INFO] - agents played in iteration 322 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:02:30,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:30,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:30,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:31,007][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:31,629][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:31,959][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:32,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:32,936][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:33,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:33,591][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:33,916][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:34,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:34,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:34,897][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:35,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:35,875][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:36,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:37,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:37,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:38,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:38,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:38,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:39,132][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:39,457][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:39,782][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:40,107][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:40,434][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:40,761][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:41,085][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:41,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:42,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:42,867][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:42,869][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:42,871][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:43,828][__main__][INFO] - Iteration 323 took 23s (39.39% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 20s. Estimated total time: 19h 23m 14s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 52s. [2025-11-13 10:02:43,831][__main__][INFO] - Starting iteration 323. [2025-11-13 10:02:43,834][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:43,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:53,334][__main__][INFO] - Number of regex retries in iteration 323: 0 [2025-11-13 10:02:53,334][__main__][INFO] - agents played in iteration 323 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:02:53,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:53,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:53,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:53,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:53,879][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:53,879][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:54,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:54,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:56,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:57,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:57,835][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:58,161][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:58,487][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:59,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:59,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:59,791][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:00,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:00,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:01,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:01,744][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:02,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:02,394][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:02,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:03,369][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:03,695][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:04,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:04,349][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:05,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:05,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:06,454][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:06,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:06,457][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:07,430][__main__][INFO] - Iteration 324 took 23s (40.25% Gen, 55.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 32s. Estimated total time: 19h 39m 50s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 38s. [2025-11-13 10:03:07,432][__main__][INFO] - Starting iteration 324. [2025-11-13 10:03:07,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:07,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:16,903][__main__][INFO] - Number of regex retries in iteration 324: 0 [2025-11-13 10:03:16,904][__main__][INFO] - agents played in iteration 324 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:03:17,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:17,453][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:17,453][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:18,192][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:18,491][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:18,816][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:19,141][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:19,792][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:20,768][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:21,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:21,748][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:22,073][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:22,399][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:22,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:23,048][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:23,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:24,022][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:24,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:25,650][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:25,975][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:26,952][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:27,603][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:27,928][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:28,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:28,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:29,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:30,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:30,062][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:30,064][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:31,029][__main__][INFO] - Iteration 325 took 23s (40.13% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 2s. Estimated total time: 19h 39m 43s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 37s. [2025-11-13 10:03:31,031][__main__][INFO] - Starting iteration 325. [2025-11-13 10:03:31,035][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:31,035][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:40,506][__main__][INFO] - Number of regex retries in iteration 325: 0 [2025-11-13 10:03:40,507][__main__][INFO] - agents played in iteration 325 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:03:40,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:40,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:41,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:41,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:41,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:41,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:42,085][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:42,410][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:42,736][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:43,064][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:43,391][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:43,716][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:44,042][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:45,019][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:45,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:45,670][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:46,648][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:46,972][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:47,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:47,624][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:47,951][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:48,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:48,603][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:48,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:49,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:49,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:50,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:50,885][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:51,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:52,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:52,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:53,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:53,633][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:53,634][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:54,609][__main__][INFO] - Iteration 326 took 23s (40.17% Gen, 55.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 40s. Estimated total time: 19h 38m 45s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 27s. [2025-11-13 10:03:54,611][__main__][INFO] - Starting iteration 326. [2025-11-13 10:03:54,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:54,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:03,628][__main__][INFO] - Number of regex retries in iteration 326: 0 [2025-11-13 10:04:03,629][__main__][INFO] - agents played in iteration 326 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:04:04,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:04,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:04,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:04,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:04,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:04,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:04,934][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:05,231][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:05,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:05,887][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:06,864][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:07,190][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:07,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:07,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:08,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:08,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:08,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:09,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:09,469][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:09,794][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:10,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:10,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:11,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:11,747][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:12,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:12,725][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:13,052][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:13,378][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:13,703][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:14,031][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:15,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:16,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:16,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:16,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:16,754][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:17,734][__main__][INFO] - Iteration 327 took 23s (38.98% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 15m 32s. Estimated total time: 19h 16m 0s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 40s. [2025-11-13 10:04:17,736][__main__][INFO] - Starting iteration 327. [2025-11-13 10:04:17,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:17,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:27,146][__main__][INFO] - Number of regex retries in iteration 327: 0 [2025-11-13 10:04:27,147][__main__][INFO] - agents played in iteration 327 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:04:27,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:27,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:27,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:27,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:27,701][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:27,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:28,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:28,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:29,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:29,392][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:29,717][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:30,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:30,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:31,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:31,342][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:31,672][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:31,996][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:32,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:33,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:33,623][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:33,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:35,905][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:36,232][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:36,884][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:37,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:37,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:38,187][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:38,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:38,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:39,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:40,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:40,258][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:40,259][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:41,215][__main__][INFO] - Iteration 328 took 23s (40.07% Gen, 55.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 56s. Estimated total time: 19h 33m 48s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 38s. [2025-11-13 10:04:41,217][__main__][INFO] - Starting iteration 328. [2025-11-13 10:04:41,220][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:41,221][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:49,088][mllm.models.large_language_model_local][WARNING] - Response |), retry 1/1 [2025-11-13 10:04:50,739][__main__][INFO] - Number of regex retries in iteration 328: 1 [2025-11-13 10:04:50,739][__main__][INFO] - agents played in iteration 328 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:04:51,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:51,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:51,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:51,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:51,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:51,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:52,013][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:52,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:52,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:53,289][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:54,591][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:54,917][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:55,245][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:55,569][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:55,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:56,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:56,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:58,498][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:58,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:59,150][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:59,799][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:00,124][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:00,773][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:01,427][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:02,078][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:02,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:03,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:03,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:03,805][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:03,807][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:04,781][__main__][INFO] - Iteration 329 took 23s (40.40% Gen, 55.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 48s. Estimated total time: 19h 38m 3s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 20s. [2025-11-13 10:05:04,783][__main__][INFO] - Starting iteration 329. [2025-11-13 10:05:04,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:04,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:14,018][__main__][INFO] - Number of regex retries in iteration 329: 0 [2025-11-13 10:05:14,019][__main__][INFO] - agents played in iteration 329 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:05:14,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:14,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:14,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:14,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:14,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:14,564][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:15,597][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:15,924][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:16,254][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:16,580][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:16,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:17,231][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:17,556][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:17,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:18,209][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:18,534][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:18,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:19,184][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:19,510][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:19,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:20,161][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:20,487][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:21,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:21,790][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:22,115][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:22,444][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:22,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:23,096][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:23,420][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:23,745][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:24,397][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:24,721][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:25,045][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:25,370][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:25,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:26,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:27,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:27,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:27,094][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:28,133][__main__][INFO] - Iteration 330 took 23s (39.54% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 25m 43s. Estimated total time: 19h 27m 22s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 33s. [2025-11-13 10:05:28,135][__main__][INFO] - Starting iteration 330. [2025-11-13 10:05:28,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:28,139][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:37,633][__main__][INFO] - Number of regex retries in iteration 330: 0 [2025-11-13 10:05:37,634][__main__][INFO] - agents played in iteration 330 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:05:38,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:38,182][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:38,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:38,928][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:39,553][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:40,206][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:40,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:42,815][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:43,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:43,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:44,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:44,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:45,096][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:45,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:45,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:46,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:46,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:47,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:47,711][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:48,036][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:48,360][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:48,691][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:49,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:49,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:50,064][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:50,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:50,798][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:50,800][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:52,625][__main__][INFO] - Iteration 331 took 24s (38.77% Gen, 53.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 21s. Estimated total time: 20h 24m 25s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 48s, 500 more iterations: 3h 24m 4s. [2025-11-13 10:05:52,627][__main__][INFO] - Starting iteration 331. [2025-11-13 10:05:52,630][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:05:52,631][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:01,437][__main__][INFO] - Number of regex retries in iteration 331: 0 [2025-11-13 10:06:01,438][__main__][INFO] - agents played in iteration 331 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:06:01,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:01,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:01,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:02,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:02,335][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:02,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:03,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:03,364][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:03,690][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:04,343][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:05,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:05,970][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:06,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:07,600][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:07,927][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:08,255][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:08,580][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:08,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:09,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:09,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:10,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:10,545][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:10,872][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:11,521][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:11,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:12,496][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:12,822][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:13,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:14,199][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:14,923][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:14,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:14,926][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:15,877][__main__][INFO] - Iteration 332 took 23s (37.88% Gen, 58.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 19m 58s. Estimated total time: 19h 22m 25s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 44s. [2025-11-13 10:06:15,880][__main__][INFO] - Starting iteration 332. [2025-11-13 10:06:15,883][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:15,883][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:25,058][__main__][INFO] - Number of regex retries in iteration 332: 0 [2025-11-13 10:06:25,058][__main__][INFO] - agents played in iteration 332 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:06:25,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:25,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:25,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:25,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:25,599][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:25,600][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:26,698][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:26,997][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:27,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:27,975][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:28,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:28,626][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:28,957][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:29,283][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:29,610][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:30,262][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:30,589][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:30,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:31,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:31,889][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:32,215][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:32,868][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:33,194][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:33,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:34,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:34,831][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:35,161][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:35,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:35,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:36,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:36,463][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:36,790][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:37,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:37,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:38,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:38,531][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:38,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:39,493][__main__][INFO] - Iteration 333 took 23s (38.86% Gen, 57.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 43s. Estimated total time: 19h 40m 33s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 45s. [2025-11-13 10:06:39,495][__main__][INFO] - Starting iteration 333. [2025-11-13 10:06:39,499][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:39,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:48,948][__main__][INFO] - Number of regex retries in iteration 333: 0 [2025-11-13 10:06:48,949][__main__][INFO] - agents played in iteration 333 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:06:49,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:49,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:49,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:49,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:49,491][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:49,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:50,535][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:50,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:51,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:51,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:51,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:52,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:52,491][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:52,819][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:53,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:53,797][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:54,123][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:54,449][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:55,755][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:56,407][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:56,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:57,389][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:58,045][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:59,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:00,010][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:00,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:01,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:02,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:02,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:02,122][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:03,144][__main__][INFO] - Iteration 334 took 23s (39.96% Gen, 55.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 39m 2s. Estimated total time: 19h 42m 16s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 10:07:03,146][__main__][INFO] - Starting iteration 334. [2025-11-13 10:07:03,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:03,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:12,580][__main__][INFO] - Number of regex retries in iteration 334: 0 [2025-11-13 10:07:12,580][__main__][INFO] - agents played in iteration 334 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:07:13,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:13,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:13,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:13,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:14,181][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:15,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:15,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:16,142][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:16,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:16,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:17,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:17,446][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:17,772][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:18,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:18,424][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:18,749][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:19,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:19,403][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:19,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:20,055][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:20,707][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:21,032][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:21,358][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:21,683][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:22,009][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:22,334][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:22,660][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:23,312][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:23,960][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:24,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:25,009][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:25,742][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:25,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:25,745][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:26,707][__main__][INFO] - Iteration 335 took 23s (40.03% Gen, 55.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 16s. Estimated total time: 19h 37m 54s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 19s. [2025-11-13 10:07:26,709][__main__][INFO] - Starting iteration 335. [2025-11-13 10:07:26,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:26,713][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:36,366][__main__][INFO] - Number of regex retries in iteration 335: 0 [2025-11-13 10:07:36,367][__main__][INFO] - agents played in iteration 335 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:07:36,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:36,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:36,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:37,946][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:38,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:38,597][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:38,923][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:39,249][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:39,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:40,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:40,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:41,201][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:41,858][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:42,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:42,513][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:42,839][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:43,165][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:43,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:44,141][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:44,467][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:44,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:45,119][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:45,443][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:45,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:46,096][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:46,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:47,404][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:48,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:48,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:49,528][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:49,529][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:49,531][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:50,542][__main__][INFO] - Iteration 336 took 23s (40.51% Gen, 55.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 32s. Estimated total time: 19h 51m 33s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 35s. [2025-11-13 10:07:50,545][__main__][INFO] - Starting iteration 336. [2025-11-13 10:07:50,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:50,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:59,547][__main__][INFO] - Number of regex retries in iteration 336: 0 [2025-11-13 10:07:59,548][__main__][INFO] - agents played in iteration 336 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:08:00,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:00,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:00,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:00,880][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:01,178][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:01,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:02,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:02,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:02,809][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:03,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:03,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:04,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:04,762][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:05,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:05,415][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:05,740][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:06,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:06,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:07,046][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:07,369][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:07,694][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:08,019][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:08,991][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:09,316][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:09,974][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:10,625][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:10,951][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:11,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:12,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:12,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:12,737][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:12,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:13,709][__main__][INFO] - Iteration 337 took 23s (38.85% Gen, 56.95% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 13m 40s. Estimated total time: 19h 18m 4s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 0s. [2025-11-13 10:08:13,712][__main__][INFO] - Starting iteration 337. [2025-11-13 10:08:13,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:13,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:22,736][__main__][INFO] - Number of regex retries in iteration 337: 0 [2025-11-13 10:08:22,737][__main__][INFO] - agents played in iteration 337 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:08:23,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:23,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:23,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:24,051][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:24,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:24,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:25,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:25,652][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:25,977][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:26,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:26,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:26,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:27,935][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:28,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:28,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:29,235][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:29,561][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:29,886][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:30,212][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:30,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:31,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:31,522][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:31,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:32,506][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:32,836][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:33,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:33,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:34,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:34,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:35,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:35,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:35,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:35,922][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:36,902][__main__][INFO] - Iteration 338 took 23s (38.90% Gen, 56.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 35s. Estimated total time: 19h 19m 23s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 13s. [2025-11-13 10:08:36,904][__main__][INFO] - Starting iteration 338. [2025-11-13 10:08:36,908][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:36,908][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:45,894][__main__][INFO] - Number of regex retries in iteration 338: 0 [2025-11-13 10:08:45,895][__main__][INFO] - agents played in iteration 338 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:08:46,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:46,447][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:46,448][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:47,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:47,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:47,832][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:48,158][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:48,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:49,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:49,788][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:50,113][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:50,438][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:50,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:51,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:51,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:51,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:52,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:52,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:53,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:53,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:54,353][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:54,679][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:55,328][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:55,654][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:55,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:56,634][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:56,958][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:57,286][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:57,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:58,287][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:59,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:59,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:59,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:59,995][__main__][INFO] - Iteration 339 took 23s (38.92% Gen, 56.83% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 9m 12s. Estimated total time: 19h 14m 23s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 23s. [2025-11-13 10:08:59,997][__main__][INFO] - Starting iteration 339. [2025-11-13 10:09:00,001][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:00,002][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:09,297][__main__][INFO] - Number of regex retries in iteration 339: 0 [2025-11-13 10:09:09,297][__main__][INFO] - agents played in iteration 339 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:09:09,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:09,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:09,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:09,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:09,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:09,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:10,590][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:10,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:11,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:11,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:12,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:12,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:13,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:14,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:15,123][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:15,448][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:15,774][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:16,100][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:16,425][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:16,750][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:17,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:17,398][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:17,724][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:19,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:19,681][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:20,007][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:20,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:20,659][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:20,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:21,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:22,395][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:22,397][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:22,398][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:23,371][__main__][INFO] - Iteration 340 took 23s (39.77% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 58s. Estimated total time: 19h 28m 32s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 45s. [2025-11-13 10:09:23,373][__main__][INFO] - Starting iteration 340. [2025-11-13 10:09:23,376][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:23,377][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:32,919][__main__][INFO] - Number of regex retries in iteration 340: 0 [2025-11-13 10:09:32,919][__main__][INFO] - agents played in iteration 340 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:09:33,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:33,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:33,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:33,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:33,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:33,506][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:34,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:34,878][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:35,528][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:36,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:37,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:37,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:37,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:38,140][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:38,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:38,791][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:39,117][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:39,443][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:39,773][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:40,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:40,422][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:40,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:41,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:42,050][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:42,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:42,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:43,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:43,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:43,686][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:44,014][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:44,669][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:45,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:46,074][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:46,075][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:46,077][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:48,059][__main__][INFO] - Iteration 341 took 24s (38.66% Gen, 53.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 11s. Estimated total time: 20h 34m 10s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 8s, 500 more iterations: 3h 25m 41s. [2025-11-13 10:09:48,061][__main__][INFO] - Starting iteration 341. [2025-11-13 10:09:48,064][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:09:48,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:57,980][__main__][INFO] - Number of regex retries in iteration 341: 0 [2025-11-13 10:09:57,981][__main__][INFO] - agents played in iteration 341 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:09:58,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:58,541][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:58,541][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:59,590][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:00,240][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:00,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:01,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:01,541][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:01,865][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:02,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:02,850][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:04,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:04,486][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:04,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:05,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:06,117][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:06,773][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:07,098][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:07,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:07,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:08,076][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:08,400][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:09,375][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:09,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:10,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:11,138][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:11,139][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:11,141][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:12,096][__main__][INFO] - Iteration 342 took 24s (41.26% Gen, 54.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 18s. Estimated total time: 20h 1m 41s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 3s, 500 more iterations: 3h 20m 16s. [2025-11-13 10:10:12,099][__main__][INFO] - Starting iteration 342. [2025-11-13 10:10:12,102][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:12,102][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:21,127][__main__][INFO] - Number of regex retries in iteration 342: 0 [2025-11-13 10:10:21,128][__main__][INFO] - agents played in iteration 342 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:10:21,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:21,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:21,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:22,421][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:23,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:24,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:24,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:25,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:25,974][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:26,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:27,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:27,610][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:28,265][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:28,589][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:29,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:29,889][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:30,539][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:30,864][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:31,193][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:31,518][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:32,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:32,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:32,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:33,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:34,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:34,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:34,249][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:35,216][__main__][INFO] - Iteration 343 took 23s (39.04% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 8m 57s. Estimated total time: 19h 15m 43s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 37s. [2025-11-13 10:10:35,218][__main__][INFO] - Starting iteration 343. [2025-11-13 10:10:35,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:35,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:45,231][__main__][INFO] - Number of regex retries in iteration 343: 0 [2025-11-13 10:10:45,231][__main__][INFO] - agents played in iteration 343 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:10:45,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:45,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:45,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:46,527][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:46,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:47,151][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:47,475][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:48,126][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:48,455][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:48,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:49,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:49,755][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:50,407][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:51,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:51,719][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:52,049][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:52,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:52,710][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:53,041][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:54,020][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:54,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:54,993][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:56,292][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:56,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:57,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:58,377][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:58,379][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:58,380][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:59,323][__main__][INFO] - Iteration 344 took 24s (41.24% Gen, 54.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 58s. Estimated total time: 20h 5m 8s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 10s, 500 more iterations: 3h 20m 51s. [2025-11-13 10:10:59,325][__main__][INFO] - Starting iteration 344. [2025-11-13 10:10:59,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:59,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:09,084][__main__][INFO] - Number of regex retries in iteration 344: 0 [2025-11-13 10:11:09,085][__main__][INFO] - agents played in iteration 344 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:11:09,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:09,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:09,635][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:10,395][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:11,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:11,343][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:11,672][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:11,997][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:12,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:12,976][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:13,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:13,956][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:14,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:14,610][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:14,937][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:15,264][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:15,592][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:15,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:16,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:16,572][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:16,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:17,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:17,871][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:18,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:18,854][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:19,833][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:20,158][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:20,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:20,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:21,530][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:22,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:22,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:22,251][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:23,271][__main__][INFO] - Iteration 345 took 23s (40.75% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 38s. Estimated total time: 19h 57m 12s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 32s. [2025-11-13 10:11:23,274][__main__][INFO] - Starting iteration 345. [2025-11-13 10:11:23,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:23,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:32,057][__main__][INFO] - Number of regex retries in iteration 345: 0 [2025-11-13 10:11:32,057][__main__][INFO] - agents played in iteration 345 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:11:32,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:32,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:32,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:32,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:32,604][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:32,604][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:33,658][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:33,985][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:34,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:34,960][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:35,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:35,610][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:35,934][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:36,911][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:37,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:37,561][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:37,887][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:38,214][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:38,540][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:38,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:39,191][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:39,516][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:39,841][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:40,166][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:40,491][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:40,816][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:41,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:41,465][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:41,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:42,113][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:42,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:42,762][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:43,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:44,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:45,121][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:45,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:45,124][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:46,106][__main__][INFO] - Iteration 346 took 22s (38.46% Gen, 57.23% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 53m 34s. Estimated total time: 19h 1m 31s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 15s. [2025-11-13 10:11:46,108][__main__][INFO] - Starting iteration 346. [2025-11-13 10:11:46,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:46,111][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:55,836][__main__][INFO] - Number of regex retries in iteration 346: 0 [2025-11-13 10:11:55,837][__main__][INFO] - agents played in iteration 346 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:11:56,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:56,384][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:56,384][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:57,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:57,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:59,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:59,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:00,699][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:02,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:02,334][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:02,662][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:02,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:03,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:03,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:04,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:04,960][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:05,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:05,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:05,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:06,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:07,243][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:07,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:08,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:09,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:09,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:09,009][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:09,978][__main__][INFO] - Iteration 347 took 23s (40.75% Gen, 55.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 3s. Estimated total time: 19h 53m 24s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 54s. [2025-11-13 10:12:09,980][__main__][INFO] - Starting iteration 347. [2025-11-13 10:12:09,984][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:09,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:19,528][__main__][INFO] - Number of regex retries in iteration 347: 0 [2025-11-13 10:12:19,528][__main__][INFO] - agents played in iteration 347 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:12:19,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:20,074][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:20,074][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:20,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:21,447][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:21,772][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:22,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:22,427][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:22,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:23,078][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:23,729][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:24,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:24,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:24,715][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:25,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:25,368][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:25,694][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:26,019][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:26,349][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:26,675][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:27,002][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:27,976][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:28,300][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:28,625][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:29,274][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:29,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:29,927][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:30,252][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:30,579][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:30,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:31,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:31,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:32,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:32,640][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:32,642][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:33,620][__main__][INFO] - Iteration 348 took 23s (40.38% Gen, 55.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 6s. Estimated total time: 19h 41m 50s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 58s. [2025-11-13 10:12:33,622][__main__][INFO] - Starting iteration 348. [2025-11-13 10:12:33,625][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:33,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:42,374][__main__][INFO] - Number of regex retries in iteration 348: 0 [2025-11-13 10:12:42,374][__main__][INFO] - agents played in iteration 348 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:12:42,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:42,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:42,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:42,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:42,918][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:42,919][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:43,953][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:44,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:44,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:44,934][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:45,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:45,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:45,912][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:46,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:46,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:46,894][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:47,220][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:47,546][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:48,195][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:48,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:48,846][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:49,172][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:49,823][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:50,148][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:51,128][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:51,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:51,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:53,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:53,402][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:54,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:54,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:55,446][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:55,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:55,449][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:56,408][__main__][INFO] - Iteration 349 took 22s (38.40% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 50m 6s. Estimated total time: 18h 59m 14s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 52s. [2025-11-13 10:12:56,410][__main__][INFO] - Starting iteration 349. [2025-11-13 10:12:56,414][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:56,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:05,586][__main__][INFO] - Number of regex retries in iteration 349: 0 [2025-11-13 10:13:05,587][__main__][INFO] - agents played in iteration 349 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:13:06,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:06,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:06,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:06,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:06,133][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:06,134][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:06,865][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:07,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:07,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:08,138][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:09,115][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:09,441][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:09,772][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:10,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:11,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:11,401][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:11,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:12,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:13,031][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:14,675][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:15,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:15,329][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:15,654][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:15,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:16,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:16,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:17,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:17,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:18,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:18,672][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:18,673][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:19,803][__main__][INFO] - Iteration 350 took 23s (39.21% Gen, 55.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 59s. Estimated total time: 19h 29m 29s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 54s. [2025-11-13 10:13:19,805][__main__][INFO] - Starting iteration 350. [2025-11-13 10:13:19,808][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:19,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:29,053][__main__][INFO] - Number of regex retries in iteration 350: 0 [2025-11-13 10:13:29,054][__main__][INFO] - agents played in iteration 350 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:13:29,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:29,605][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:29,606][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:30,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:31,015][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:31,344][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:31,670][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:32,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:32,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:33,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:33,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:35,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:35,578][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:35,904][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:36,229][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:36,555][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:36,882][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:37,209][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:37,535][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:38,187][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:38,511][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:38,835][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:39,160][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:40,135][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:40,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:41,491][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:42,212][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:42,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:42,215][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:44,151][__main__][INFO] - Iteration 351 took 24s (37.98% Gen, 54.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 7m 15s. Estimated total time: 20h 17m 10s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 34s, 500 more iterations: 3h 22m 51s. [2025-11-13 10:13:44,153][__main__][INFO] - Starting iteration 351. [2025-11-13 10:13:44,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:13:44,157][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:53,932][__main__][INFO] - Number of regex retries in iteration 351: 0 [2025-11-13 10:13:53,933][__main__][INFO] - agents played in iteration 351 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:13:54,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:54,488][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:54,488][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:55,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:55,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:56,175][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:56,502][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:56,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:57,483][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:58,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:59,115][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:00,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:00,750][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:01,077][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:02,380][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:03,028][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:04,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:04,336][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:04,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:04,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:05,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:05,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:06,376][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:07,115][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:07,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:07,118][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:08,072][__main__][INFO] - Iteration 352 took 23s (40.87% Gen, 55.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 29s. Estimated total time: 19h 55m 48s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 18s. [2025-11-13 10:14:08,074][__main__][INFO] - Starting iteration 352. [2025-11-13 10:14:08,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:08,078][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:17,609][__main__][INFO] - Number of regex retries in iteration 352: 0 [2025-11-13 10:14:17,609][__main__][INFO] - agents played in iteration 352 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:14:18,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:18,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:18,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:18,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:18,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:18,151][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:18,896][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:19,193][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:19,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:19,845][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:20,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:20,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:21,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:21,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:21,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:22,137][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:22,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:22,793][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:23,121][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:23,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:24,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:24,428][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:24,760][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:25,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:25,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:26,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:26,724][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:27,054][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:27,378][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:27,704][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:28,029][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:28,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:28,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:29,004][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:29,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:30,010][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:30,726][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:30,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:30,730][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:31,710][__main__][INFO] - Iteration 353 took 23s (40.33% Gen, 55.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 57s. Estimated total time: 19h 41m 39s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 56s. [2025-11-13 10:14:31,712][__main__][INFO] - Starting iteration 353. [2025-11-13 10:14:31,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:31,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:41,006][__main__][INFO] - Number of regex retries in iteration 353: 0 [2025-11-13 10:14:41,007][__main__][INFO] - agents played in iteration 353 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:14:41,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:41,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:41,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:41,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:41,549][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:41,550][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:42,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:42,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:43,585][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:43,911][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:44,561][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:45,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:45,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:45,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:46,184][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:46,509][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:46,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:47,161][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:47,485][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:47,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:48,136][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:48,465][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:49,121][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:49,448][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:49,775][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:50,425][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:50,753][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:51,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:51,403][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:51,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:52,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:52,377][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:52,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:53,382][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:54,102][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:54,104][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:54,105][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:55,073][__main__][INFO] - Iteration 354 took 23s (39.78% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 48s. Estimated total time: 19h 27m 54s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 39s. [2025-11-13 10:14:55,075][__main__][INFO] - Starting iteration 354. [2025-11-13 10:14:55,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:55,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:04,280][__main__][INFO] - Number of regex retries in iteration 354: 0 [2025-11-13 10:15:04,280][__main__][INFO] - agents played in iteration 354 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:15:04,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:04,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:04,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:04,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:04,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:04,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:05,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:05,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:06,523][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:06,848][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:07,174][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:07,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:07,826][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:08,153][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:08,481][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:09,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:09,460][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:10,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:10,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:10,768][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:11,093][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:11,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:11,747][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:12,073][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:12,403][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:12,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:13,060][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:13,384][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:14,037][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:14,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:14,693][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:15,351][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:15,680][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:16,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:16,699][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:17,439][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:17,440][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:17,442][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:18,415][__main__][INFO] - Iteration 355 took 23s (39.42% Gen, 56.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 15m 24s. Estimated total time: 19h 26m 54s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 29s. [2025-11-13 10:15:18,417][__main__][INFO] - Starting iteration 355. [2025-11-13 10:15:18,421][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:18,422][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:27,973][__main__][INFO] - Number of regex retries in iteration 355: 0 [2025-11-13 10:15:27,974][__main__][INFO] - agents played in iteration 355 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:15:28,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:28,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:28,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:28,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:28,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:28,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:29,277][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:29,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:29,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:30,551][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:30,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:31,202][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:31,527][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:31,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:32,502][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:32,833][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:33,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:34,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:34,456][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:34,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:35,434][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:35,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:36,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:37,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:37,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:37,713][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:38,039][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:38,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:39,012][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:39,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:40,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:41,053][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:41,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:41,057][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:42,027][__main__][INFO] - Iteration 356 took 23s (40.46% Gen, 55.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 26s. Estimated total time: 19h 40m 19s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 43s. [2025-11-13 10:15:42,029][__main__][INFO] - Starting iteration 356. [2025-11-13 10:15:42,032][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:42,033][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:51,256][__main__][INFO] - Number of regex retries in iteration 356: 0 [2025-11-13 10:15:51,256][__main__][INFO] - agents played in iteration 356 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:15:51,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:51,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:51,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:51,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:51,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:51,804][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:52,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:53,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:54,477][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:55,128][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:55,453][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:55,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:56,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:56,432][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:58,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:58,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:59,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:59,365][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:59,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:00,991][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:01,316][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:01,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:01,967][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:02,292][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:02,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:03,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:04,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:04,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:04,339][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:05,339][__main__][INFO] - Iteration 357 took 23s (39.57% Gen, 56.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 7s. Estimated total time: 19h 25m 23s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 13s. [2025-11-13 10:16:05,341][__main__][INFO] - Starting iteration 357. [2025-11-13 10:16:05,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:05,345][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:14,642][__main__][INFO] - Number of regex retries in iteration 357: 0 [2025-11-13 10:16:14,643][__main__][INFO] - agents played in iteration 357 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:16:15,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:15,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:15,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:15,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:15,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:15,185][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:16,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:17,853][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:18,178][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:18,505][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:18,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:19,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:19,483][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:19,809][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:20,134][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:20,459][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:20,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:21,116][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:21,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:21,770][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:22,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:22,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:23,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:23,409][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:23,732][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:24,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:24,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:24,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:25,030][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:25,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:26,002][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:26,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:27,006][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:27,732][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:27,734][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:27,736][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:28,724][__main__][INFO] - Iteration 358 took 23s (39.76% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 22s. Estimated total time: 19h 29m 2s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 50s. [2025-11-13 10:16:28,726][__main__][INFO] - Starting iteration 358. [2025-11-13 10:16:28,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:28,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:37,773][__main__][INFO] - Number of regex retries in iteration 358: 0 [2025-11-13 10:16:37,774][__main__][INFO] - agents played in iteration 358 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:16:38,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:38,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:38,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:38,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:38,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:38,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:39,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:39,686][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:40,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:40,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:42,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:42,618][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:42,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:43,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:44,243][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:44,899][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:45,225][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:45,553][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:45,880][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:46,861][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:47,187][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:47,513][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:48,165][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:48,491][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:48,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:49,146][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:49,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:50,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:50,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:50,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:50,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:51,837][__main__][INFO] - Iteration 359 took 23s (39.13% Gen, 56.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 23s. Estimated total time: 19h 15m 26s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 34s. [2025-11-13 10:16:51,840][__main__][INFO] - Starting iteration 359. [2025-11-13 10:16:51,843][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:51,844][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:01,163][__main__][INFO] - Number of regex retries in iteration 359: 0 [2025-11-13 10:17:01,164][__main__][INFO] - agents played in iteration 359 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:17:01,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:01,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:01,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:01,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:01,704][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:01,704][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:02,742][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:03,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:03,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:03,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:04,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:04,369][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:04,694][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:05,020][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:05,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:05,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:06,325][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:06,975][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:08,282][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:08,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:09,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:10,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:12,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:12,536][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:12,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:13,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:14,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:14,291][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:14,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:15,250][__main__][INFO] - Iteration 360 took 23s (39.82% Gen, 56.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 57s. Estimated total time: 19h 30m 23s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 3s. [2025-11-13 10:17:15,253][__main__][INFO] - Starting iteration 360. [2025-11-13 10:17:15,256][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:15,257][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:24,332][__main__][INFO] - Number of regex retries in iteration 360: 0 [2025-11-13 10:17:24,332][__main__][INFO] - agents played in iteration 360 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:17:24,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:24,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:24,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:24,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:24,876][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:24,876][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:25,916][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:27,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:27,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:27,871][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:28,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:28,522][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:28,847][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:29,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:29,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:29,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:30,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:30,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:31,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:32,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:32,440][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:33,417][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:33,745][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:34,394][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:35,047][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:35,373][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:35,697][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:36,025][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:36,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:37,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:37,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:37,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:39,346][__main__][INFO] - Iteration 361 took 24s (37.67% Gen, 54.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 41s. Estimated total time: 20h 4m 31s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 45s. [2025-11-13 10:17:39,348][__main__][INFO] - Starting iteration 361. [2025-11-13 10:17:39,351][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:17:39,352][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:49,224][__main__][INFO] - Number of regex retries in iteration 361: 0 [2025-11-13 10:17:49,224][__main__][INFO] - agents played in iteration 361 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:17:49,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:49,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:49,774][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:50,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:50,814][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:51,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:51,467][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:52,118][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:52,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:53,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:54,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:55,052][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:55,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:55,702][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:56,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:56,676][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:57,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:57,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:57,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:58,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:58,630][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:59,279][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:59,929][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:00,253][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:00,577][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:00,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:01,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:02,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:02,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:02,341][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:03,307][__main__][INFO] - Iteration 362 took 23s (41.21% Gen, 54.75% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 36s. Estimated total time: 19h 57m 50s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 55s, 500 more iterations: 3h 19m 38s. [2025-11-13 10:18:03,309][__main__][INFO] - Starting iteration 362. [2025-11-13 10:18:03,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:03,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:12,683][__main__][INFO] - Number of regex retries in iteration 362: 0 [2025-11-13 10:18:12,683][__main__][INFO] - agents played in iteration 362 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:18:13,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:13,241][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:13,241][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:14,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:14,604][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:14,929][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:15,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:15,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:16,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:16,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:16,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:17,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:17,861][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:18,188][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:18,519][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:18,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:19,173][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:19,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:19,831][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:20,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:21,140][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:21,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:21,801][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:22,130][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:22,461][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:22,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:23,443][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:23,768][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:24,092][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:24,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:25,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:25,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:25,835][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:25,837][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:26,858][__main__][INFO] - Iteration 363 took 23s (39.79% Gen, 55.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 41s. Estimated total time: 19h 37m 18s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 13s. [2025-11-13 10:18:26,860][__main__][INFO] - Starting iteration 363. [2025-11-13 10:18:26,863][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:26,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:35,885][__main__][INFO] - Number of regex retries in iteration 363: 0 [2025-11-13 10:18:35,885][__main__][INFO] - agents played in iteration 363 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:18:36,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:36,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:36,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:36,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:36,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:36,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:37,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:38,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:38,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:39,107][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:39,759][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:40,086][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:40,412][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:41,063][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:41,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:41,714][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:42,039][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:42,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:43,341][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:43,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:43,996][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:44,652][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:45,307][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:45,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:45,966][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:46,290][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:46,615][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:46,939][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:47,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:48,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:48,980][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:48,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:48,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:49,953][__main__][INFO] - Iteration 364 took 23s (39.07% Gen, 56.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 32s. Estimated total time: 19h 14m 33s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 25s. [2025-11-13 10:18:49,956][__main__][INFO] - Starting iteration 364. [2025-11-13 10:18:49,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:49,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:59,290][__main__][INFO] - Number of regex retries in iteration 364: 0 [2025-11-13 10:18:59,290][__main__][INFO] - agents played in iteration 364 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:18:59,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:59,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:59,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:00,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:01,217][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:01,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:02,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:02,520][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:03,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:03,821][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:04,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:04,474][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:04,799][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:05,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:05,454][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:05,782][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:06,108][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:06,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:06,766][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:07,416][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:07,745][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:10,682][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:11,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:11,689][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:12,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:12,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:12,407][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:13,376][__main__][INFO] - Iteration 365 took 23s (39.85% Gen, 56.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 15m 29s. Estimated total time: 19h 30m 53s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 8s. [2025-11-13 10:19:13,378][__main__][INFO] - Starting iteration 365. [2025-11-13 10:19:13,382][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:13,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:22,532][__main__][INFO] - Number of regex retries in iteration 365: 0 [2025-11-13 10:19:22,532][__main__][INFO] - agents played in iteration 365 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:19:22,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:23,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:23,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:23,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:23,084][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:23,084][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:23,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:24,132][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:24,458][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:24,784][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:25,110][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:25,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:25,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:26,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:26,413][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:26,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:27,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:28,046][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:28,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:29,347][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:29,673][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:29,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:30,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:30,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:30,977][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:31,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:31,628][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:31,955][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:32,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:32,607][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:32,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:33,257][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:33,582][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:33,907][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:34,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:34,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:35,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:35,626][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:35,627][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:36,599][__main__][INFO] - Iteration 366 took 23s (39.39% Gen, 56.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 5m 7s. Estimated total time: 19h 20m 55s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 29s. [2025-11-13 10:19:36,602][__main__][INFO] - Starting iteration 366. [2025-11-13 10:19:36,605][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:36,606][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:45,670][__main__][INFO] - Number of regex retries in iteration 366: 0 [2025-11-13 10:19:45,670][__main__][INFO] - agents played in iteration 366 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:19:46,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:46,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:46,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:46,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:46,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:46,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:46,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:47,243][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:47,569][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:47,895][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:48,221][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:48,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:48,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:49,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:49,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:49,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:50,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:51,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:51,802][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:52,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:52,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:53,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:53,761][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:54,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:54,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:55,404][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:55,732][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:56,391][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:56,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:57,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:58,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:58,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:58,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:58,809][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:59,806][__main__][INFO] - Iteration 367 took 23s (39.07% Gen, 56.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 54s. Estimated total time: 19h 20m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 20s. [2025-11-13 10:19:59,808][__main__][INFO] - Starting iteration 367. [2025-11-13 10:19:59,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:59,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:09,213][__main__][INFO] - Number of regex retries in iteration 367: 0 [2025-11-13 10:20:09,214][__main__][INFO] - agents played in iteration 367 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:20:09,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:09,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:09,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:09,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:09,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:09,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:11,446][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:11,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:12,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:12,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:13,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:13,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:13,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:14,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:14,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:15,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:15,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:15,672][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:15,997][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:16,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:16,648][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:16,974][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:17,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:17,625][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:18,284][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:18,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:18,936][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:19,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:19,593][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:20,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:21,587][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:22,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:22,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:22,296][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:23,267][__main__][INFO] - Iteration 368 took 23s (40.08% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 12s. Estimated total time: 19h 32m 46s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 27s. [2025-11-13 10:20:23,269][__main__][INFO] - Starting iteration 368. [2025-11-13 10:20:23,272][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:23,273][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:32,100][__main__][INFO] - Number of regex retries in iteration 368: 0 [2025-11-13 10:20:32,101][__main__][INFO] - agents played in iteration 368 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:20:32,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:32,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:32,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:32,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:32,653][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:32,653][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:33,391][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:33,688][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:34,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:34,667][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:35,319][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:35,970][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:36,296][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:36,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:36,947][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:37,273][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:37,924][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:38,251][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:38,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:38,900][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:39,551][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:40,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:40,526][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:41,177][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:41,503][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:42,154][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:42,485][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:42,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:43,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:43,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:44,476][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:45,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:45,208][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:45,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:46,197][__main__][INFO] - Iteration 369 took 22s (38.51% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 49m 20s. Estimated total time: 19h 6m 17s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 2s. [2025-11-13 10:20:46,199][__main__][INFO] - Starting iteration 369. [2025-11-13 10:20:46,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:46,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:54,943][__main__][INFO] - Number of regex retries in iteration 369: 0 [2025-11-13 10:20:54,944][__main__][INFO] - agents played in iteration 369 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:20:55,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:55,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:55,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:55,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:55,503][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:55,504][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:56,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:56,865][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:57,841][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:58,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:58,819][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:00,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:00,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:00,772][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:01,098][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:01,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:01,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:02,078][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:02,405][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:03,058][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:03,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:04,369][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:05,026][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:05,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:06,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:06,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:07,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:08,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:08,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:08,094][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:09,069][__main__][INFO] - Iteration 370 took 22s (38.22% Gen, 57.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 46m 0s. Estimated total time: 19h 3m 20s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 33s. [2025-11-13 10:21:09,071][__main__][INFO] - Starting iteration 370. [2025-11-13 10:21:09,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:09,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:18,479][__main__][INFO] - Number of regex retries in iteration 370: 0 [2025-11-13 10:21:18,479][__main__][INFO] - agents played in iteration 370 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:21:18,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:18,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:19,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:19,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:19,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:19,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:19,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:20,083][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:20,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:21,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:21,384][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:21,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:22,362][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:22,687][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:23,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:23,339][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:23,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:23,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:24,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:24,644][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:24,972][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:25,297][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:25,622][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:25,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:26,277][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:26,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:28,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:28,566][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:28,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:29,220][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:29,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:29,871][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:30,197][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:30,896][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:31,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:31,615][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:31,617][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:33,592][__main__][INFO] - Iteration 371 took 24s (38.35% Gen, 53.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 10s. Estimated total time: 20h 25m 54s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 51s, 500 more iterations: 3h 24m 19s. [2025-11-13 10:21:33,595][__main__][INFO] - Starting iteration 371. [2025-11-13 10:21:33,600][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:21:33,601][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:43,073][__main__][INFO] - Number of regex retries in iteration 371: 0 [2025-11-13 10:21:43,074][__main__][INFO] - agents played in iteration 371 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:21:43,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:43,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:43,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:43,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:43,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:43,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:44,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:44,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:45,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:45,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:45,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:46,620][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:46,946][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:47,271][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:48,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:49,222][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:49,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:49,873][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:50,530][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:51,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:51,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:52,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:52,485][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:53,136][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:53,461][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:53,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:54,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:55,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:56,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:56,181][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:56,183][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:57,163][__main__][INFO] - Iteration 372 took 23s (40.20% Gen, 55.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 4s. Estimated total time: 19h 38m 12s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 22s. [2025-11-13 10:21:57,165][__main__][INFO] - Starting iteration 372. [2025-11-13 10:21:57,168][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:21:57,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:06,745][__main__][INFO] - Number of regex retries in iteration 372: 0 [2025-11-13 10:22:06,746][__main__][INFO] - agents played in iteration 372 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:22:07,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:07,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:07,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:07,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:07,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:07,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:08,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:08,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:09,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:09,970][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:10,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:11,272][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:11,599][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:11,925][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:12,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:12,904][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:13,230][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:13,882][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:14,209][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:14,534][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:14,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:15,512][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:16,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:16,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:17,148][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:17,476][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:17,802][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:18,127][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:18,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:19,125][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:19,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:19,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:19,831][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:20,817][__main__][INFO] - Iteration 373 took 23s (40.49% Gen, 55.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 59s. Estimated total time: 19h 42m 30s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 5s. [2025-11-13 10:22:20,819][__main__][INFO] - Starting iteration 373. [2025-11-13 10:22:20,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:20,823][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:26,363][mllm.models.large_language_model_local][WARNING] - Response } did not match regex: (|), retry 1/1 [2025-11-13 10:22:30,328][__main__][INFO] - Number of regex retries in iteration 373: 1 [2025-11-13 10:22:30,328][__main__][INFO] - agents played in iteration 373 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:22:30,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:30,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:30,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:30,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:30,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:30,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:31,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:32,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:32,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:32,902][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:33,228][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:33,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:33,878][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:34,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:34,529][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:34,855][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:35,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:35,831][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:36,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:36,807][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:37,133][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:37,788][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:38,118][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:38,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:39,095][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:39,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:39,747][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:41,056][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:41,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:42,035][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:42,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:43,427][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:43,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:43,430][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:44,381][__main__][INFO] - Iteration 374 took 23s (40.35% Gen, 55.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 2s. Estimated total time: 19h 37m 58s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 19s. [2025-11-13 10:22:44,383][__main__][INFO] - Starting iteration 374. [2025-11-13 10:22:44,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:44,387][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:52,696][__main__][INFO] - Number of regex retries in iteration 374: 0 [2025-11-13 10:22:52,697][__main__][INFO] - agents played in iteration 374 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:22:53,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:53,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:53,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:53,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:53,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:53,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:53,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:54,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:54,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:55,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:55,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:56,241][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:56,566][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:56,892][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:57,216][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:57,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:58,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:58,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:59,167][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:59,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:59,817][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:00,143][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:00,468][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:00,795][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:01,120][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:01,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:01,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:02,101][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:02,758][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:03,085][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:03,410][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:03,735][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:04,062][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:04,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:05,114][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:05,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:05,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:05,852][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:06,858][__main__][INFO] - Iteration 375 took 22s (36.98% Gen, 58.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 24m 18s. Estimated total time: 18h 43m 36s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 27s, 500 more iterations: 3h 7m 16s. [2025-11-13 10:23:06,860][__main__][INFO] - Starting iteration 375. [2025-11-13 10:23:06,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:06,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:16,218][__main__][INFO] - Number of regex retries in iteration 375: 0 [2025-11-13 10:23:16,218][__main__][INFO] - agents played in iteration 375 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:23:16,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:16,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:16,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:16,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:16,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:16,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:18,125][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:18,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:19,100][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:19,426][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:19,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:20,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:21,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:21,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:21,703][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:22,029][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:22,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:22,680][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:23,005][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:23,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:24,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:25,286][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:25,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:25,938][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:26,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:26,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:26,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:27,568][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:27,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:28,614][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:29,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:29,331][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:29,332][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:30,310][__main__][INFO] - Iteration 376 took 23s (39.89% Gen, 55.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 40s. Estimated total time: 19h 32m 21s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 23s. [2025-11-13 10:23:30,312][__main__][INFO] - Starting iteration 376. [2025-11-13 10:23:30,315][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:30,316][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:39,893][__main__][INFO] - Number of regex retries in iteration 376: 0 [2025-11-13 10:23:39,894][__main__][INFO] - agents played in iteration 376 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:23:40,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:40,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:40,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:40,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:40,440][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:40,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:41,480][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:41,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:42,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:42,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:43,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:44,090][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:44,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:45,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:45,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:45,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:46,368][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:46,694][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:47,020][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:47,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:47,673][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:47,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:48,322][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:48,648][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:48,974][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:49,622][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:49,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:50,272][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:50,597][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:50,924][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:51,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:51,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:52,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:53,033][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:53,034][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:53,036][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:54,216][__main__][INFO] - Iteration 377 took 23s (40.07% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 58s. Estimated total time: 19h 55m 3s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 10s. [2025-11-13 10:23:54,218][__main__][INFO] - Starting iteration 377. [2025-11-13 10:23:54,221][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:54,222][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:03,428][__main__][INFO] - Number of regex retries in iteration 377: 0 [2025-11-13 10:24:03,429][__main__][INFO] - agents played in iteration 377 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:24:03,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:03,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:03,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:03,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:03,981][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:03,982][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:05,025][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:05,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:05,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:06,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:06,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:06,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:07,305][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:08,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:08,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:09,258][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:09,582][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:09,908][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:10,234][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:10,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:11,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:11,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:11,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:12,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:12,512][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:12,835][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:13,161][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:13,484][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:13,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:14,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:14,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:15,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:15,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:16,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:16,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:16,547][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:17,512][__main__][INFO] - Iteration 378 took 23s (39.53% Gen, 56.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 4m 7s. Estimated total time: 19h 24m 35s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 5s. [2025-11-13 10:24:17,514][__main__][INFO] - Starting iteration 378. [2025-11-13 10:24:17,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:17,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:26,311][__main__][INFO] - Number of regex retries in iteration 378: 0 [2025-11-13 10:24:26,312][__main__][INFO] - agents played in iteration 378 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:24:26,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:26,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:26,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:26,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:26,856][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:26,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:27,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:28,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:28,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:29,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:29,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:29,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:30,509][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:31,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:31,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:31,810][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:32,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:32,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:32,784][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:33,110][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:33,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:34,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:34,416][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:34,745][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:35,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:35,721][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:36,051][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:36,377][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:36,706][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:37,030][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:37,359][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:37,683][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:38,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:38,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:39,598][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:39,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:39,602][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:40,626][__main__][INFO] - Iteration 379 took 23s (38.05% Gen, 57.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 54m 36s. Estimated total time: 19h 15m 27s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 34s. [2025-11-13 10:24:40,628][__main__][INFO] - Starting iteration 379. [2025-11-13 10:24:40,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:40,632][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:50,093][__main__][INFO] - Number of regex retries in iteration 379: 0 [2025-11-13 10:24:50,094][__main__][INFO] - agents played in iteration 379 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:24:50,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:50,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:50,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:50,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:50,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:50,638][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:51,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:52,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:52,325][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:52,976][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:53,301][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:53,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:53,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:54,275][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:54,599][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:55,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:55,899][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:56,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:56,550][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:57,522][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:58,497][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:58,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:59,801][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:00,126][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:00,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:01,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:01,435][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:01,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:02,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:03,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:03,194][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:03,196][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:04,180][__main__][INFO] - Iteration 380 took 23s (40.18% Gen, 55.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 13s. Estimated total time: 19h 37m 28s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 14s. [2025-11-13 10:25:04,183][__main__][INFO] - Starting iteration 380. [2025-11-13 10:25:04,186][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:04,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:13,928][__main__][INFO] - Number of regex retries in iteration 380: 0 [2025-11-13 10:25:13,928][__main__][INFO] - agents played in iteration 380 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:25:14,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:14,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:14,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:14,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:14,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:14,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:15,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:16,511][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:18,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:19,761][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:20,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:20,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:20,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:21,064][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:21,391][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:21,716][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:22,045][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:22,372][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:23,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:23,349][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:23,674][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:24,325][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:24,650][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:24,975][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:25,299][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:25,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:26,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:27,022][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:27,023][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:27,025][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:29,090][__main__][INFO] - Iteration 381 took 24s (39.12% Gen, 52.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 23m 36s. Estimated total time: 20h 45m 15s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 30s, 500 more iterations: 3h 27m 32s. [2025-11-13 10:25:29,093][__main__][INFO] - Starting iteration 381. [2025-11-13 10:25:29,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:25:29,097][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:38,770][__main__][INFO] - Number of regex retries in iteration 381: 0 [2025-11-13 10:25:38,771][__main__][INFO] - agents played in iteration 381 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:25:39,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:39,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:39,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:39,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:39,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:39,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:40,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:40,686][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:41,011][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:41,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:41,661][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:41,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:42,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:42,635][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:42,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:43,285][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:43,611][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:43,935][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:44,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:44,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:44,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:46,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:46,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:46,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:47,192][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:47,516][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:47,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:48,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:48,493][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:49,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:49,473][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:49,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:50,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:50,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:51,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:51,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:51,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:51,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:52,885][__main__][INFO] - Iteration 382 took 23s (40.66% Gen, 55.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 24s. Estimated total time: 19h 49m 28s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 14s. [2025-11-13 10:25:52,888][__main__][INFO] - Starting iteration 382. [2025-11-13 10:25:52,892][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:25:52,893][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:02,735][__main__][INFO] - Number of regex retries in iteration 382: 0 [2025-11-13 10:26:02,736][__main__][INFO] - agents played in iteration 382 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:26:03,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:03,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:03,288][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:04,660][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:04,984][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:05,310][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:05,635][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:05,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:06,286][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:06,611][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:06,937][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:07,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:07,587][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:08,238][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:08,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:09,213][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:09,538][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:09,865][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:11,170][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:11,821][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:12,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:12,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:13,132][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:13,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:13,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:14,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:15,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:15,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:15,846][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:15,848][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:16,817][__main__][INFO] - Iteration 383 took 23s (41.14% Gen, 54.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 50s. Estimated total time: 19h 56m 18s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 23s. [2025-11-13 10:26:16,819][__main__][INFO] - Starting iteration 383. [2025-11-13 10:26:16,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:16,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:26,219][__main__][INFO] - Number of regex retries in iteration 383: 0 [2025-11-13 10:26:26,220][__main__][INFO] - agents played in iteration 383 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:26:26,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:26,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:26,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:26,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:26,769][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:26,770][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:27,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:28,141][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:28,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:28,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:29,118][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:29,443][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:29,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:30,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:30,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:30,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:31,072][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:31,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:31,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:32,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:33,024][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:33,348][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:33,674][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:34,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:35,951][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:36,279][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:36,605][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:36,930][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:37,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:37,585][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:37,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:38,613][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:39,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:39,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:39,338][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:40,341][__main__][INFO] - Iteration 384 took 23s (39.95% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 4s. Estimated total time: 19h 35m 55s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 59s. [2025-11-13 10:26:40,343][__main__][INFO] - Starting iteration 384. [2025-11-13 10:26:40,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:40,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:49,910][__main__][INFO] - Number of regex retries in iteration 384: 0 [2025-11-13 10:26:49,911][__main__][INFO] - agents played in iteration 384 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:26:50,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:50,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:50,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:51,238][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:51,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:51,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:52,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:52,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:53,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:53,488][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:53,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:54,140][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:54,465][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:54,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:55,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:55,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:56,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:57,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:58,048][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:58,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:58,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:59,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:59,677][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:00,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:00,660][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:00,986][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:01,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:02,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:03,071][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:03,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:03,074][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:04,011][__main__][INFO] - Iteration 385 took 23s (40.41% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 2s. Estimated total time: 19h 43m 17s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 12s. [2025-11-13 10:27:04,013][__main__][INFO] - Starting iteration 385. [2025-11-13 10:27:04,016][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:04,017][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:13,289][__main__][INFO] - Number of regex retries in iteration 385: 0 [2025-11-13 10:27:13,290][__main__][INFO] - agents played in iteration 385 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:27:13,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:14,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:14,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:14,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:14,202][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:14,203][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:15,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:15,911][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:16,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:17,214][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:17,540][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:17,865][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:18,191][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:18,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:19,496][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:20,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:20,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:20,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:21,450][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:21,776][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:22,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:22,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:22,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:23,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:23,745][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:24,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:24,402][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:25,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:26,088][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:26,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:26,813][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:26,815][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:27,938][__main__][INFO] - Iteration 386 took 23s (38.76% Gen, 56.54% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 30s. Estimated total time: 19h 56m 9s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 21s. [2025-11-13 10:27:27,940][__main__][INFO] - Starting iteration 386. [2025-11-13 10:27:27,943][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:27,944][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:37,654][__main__][INFO] - Number of regex retries in iteration 386: 0 [2025-11-13 10:27:37,654][__main__][INFO] - agents played in iteration 386 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:27:38,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:38,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:38,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:38,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:38,203][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:38,203][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:39,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:39,882][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:40,207][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:40,859][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:41,510][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:41,835][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:42,488][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:43,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:43,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:44,119][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:44,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:45,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:45,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:45,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:46,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:47,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:48,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:48,378][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:48,704][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:49,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:49,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:50,044][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:50,731][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:50,733][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:50,735][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:51,971][__main__][INFO] - Iteration 387 took 24s (40.41% Gen, 54.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 24s. Estimated total time: 20h 1m 27s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 2s, 500 more iterations: 3h 20m 14s. [2025-11-13 10:27:51,973][__main__][INFO] - Starting iteration 387. [2025-11-13 10:27:51,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:51,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:00,985][__main__][INFO] - Number of regex retries in iteration 387: 0 [2025-11-13 10:28:00,985][__main__][INFO] - agents played in iteration 387 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:28:01,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:01,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:01,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:02,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:02,895][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:03,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:03,544][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:03,870][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:04,195][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:04,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:04,845][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:05,171][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:05,497][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:05,824][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:06,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:06,802][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:07,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:07,456][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:07,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:08,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:08,766][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:09,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:09,742][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:10,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:10,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:10,722][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:11,373][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:12,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:12,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:13,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:14,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:14,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:14,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:15,046][__main__][INFO] - Iteration 388 took 23s (39.05% Gen, 56.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 49m 6s. Estimated total time: 19h 13m 32s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 15s. [2025-11-13 10:28:15,048][__main__][INFO] - Starting iteration 388. [2025-11-13 10:28:15,051][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:15,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:24,096][__main__][INFO] - Number of regex retries in iteration 388: 0 [2025-11-13 10:28:24,096][__main__][INFO] - agents played in iteration 388 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:28:24,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:24,644][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:24,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:26,005][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:26,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:27,307][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:27,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:27,957][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:28,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:28,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:29,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:29,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:29,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:30,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:30,572][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:30,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:31,221][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:31,548][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:31,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:32,204][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:32,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:33,187][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:33,839][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:34,163][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:34,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:35,140][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:35,465][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:35,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:36,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:37,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:37,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:37,213][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:38,134][__main__][INFO] - Iteration 389 took 23s (39.18% Gen, 56.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 49m 23s. Estimated total time: 19h 14m 12s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 22s. [2025-11-13 10:28:38,136][__main__][INFO] - Starting iteration 389. [2025-11-13 10:28:38,139][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:38,140][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:46,678][__main__][INFO] - Number of regex retries in iteration 389: 0 [2025-11-13 10:28:46,678][__main__][INFO] - agents played in iteration 389 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:28:47,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:47,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:47,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:48,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:49,265][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:49,916][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:50,240][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:50,566][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:50,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:51,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:51,544][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:51,869][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:52,194][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:52,520][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:52,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:53,177][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:53,504][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:53,829][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:54,152][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:54,477][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:55,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:55,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:55,778][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:56,103][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:56,428][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:56,754][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:57,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:57,406][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:57,731][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:58,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:58,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:59,417][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:00,141][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:00,142][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:00,144][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:01,078][__main__][INFO] - Iteration 390 took 22s (37.22% Gen, 58.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 41m 47s. Estimated total time: 19h 6m 59s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 9s. [2025-11-13 10:29:01,080][__main__][INFO] - Starting iteration 390. [2025-11-13 10:29:01,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:01,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:10,357][__main__][INFO] - Number of regex retries in iteration 390: 0 [2025-11-13 10:29:10,358][__main__][INFO] - agents played in iteration 390 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:29:10,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:10,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:10,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:11,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:11,943][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:12,269][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:12,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:13,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:13,897][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:14,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:14,547][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:14,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:15,198][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:15,524][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:16,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:16,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:17,157][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:18,135][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:18,788][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:19,115][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:19,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:19,770][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:20,095][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:21,081][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:21,409][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:21,734][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:22,062][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:22,777][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:23,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:23,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:23,513][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:25,498][__main__][INFO] - Iteration 391 took 24s (37.98% Gen, 53.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 8s. Estimated total time: 20h 20m 44s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 41s, 500 more iterations: 3h 23m 27s. [2025-11-13 10:29:25,500][__main__][INFO] - Starting iteration 391. [2025-11-13 10:29:25,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:29:25,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:35,183][__main__][INFO] - Number of regex retries in iteration 391: 0 [2025-11-13 10:29:35,184][__main__][INFO] - agents played in iteration 391 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:29:35,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:35,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:35,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:35,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:35,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:35,733][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:36,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:36,761][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:37,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:37,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:37,738][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:38,062][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:38,713][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:39,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:39,363][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:40,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:40,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:41,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:41,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:41,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:42,292][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:42,621][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:42,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:43,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:43,935][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:44,594][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:44,921][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:45,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:45,576][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:45,903][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:46,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:46,557][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:46,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:47,589][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:48,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:48,313][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:48,315][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:49,296][__main__][INFO] - Iteration 392 took 23s (40.69% Gen, 55.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 45s. Estimated total time: 19h 49m 45s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 17s. [2025-11-13 10:29:49,299][__main__][INFO] - Starting iteration 392. [2025-11-13 10:29:49,302][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:29:49,303][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:58,259][__main__][INFO] - Number of regex retries in iteration 392: 0 [2025-11-13 10:29:58,260][__main__][INFO] - agents played in iteration 392 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:29:58,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:59,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:59,154][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:59,905][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:00,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:00,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:01,180][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:01,507][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:02,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:02,480][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:02,806][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:03,132][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:03,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:03,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:04,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:04,433][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:05,408][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:05,735][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:06,391][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:06,716][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:07,041][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:07,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:07,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:08,017][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:08,342][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:08,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:09,649][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:09,975][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:10,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:10,994][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:11,727][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:11,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:11,730][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:12,799][__main__][INFO] - Iteration 393 took 23s (38.12% Gen, 57.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 8m 28s. Estimated total time: 19h 34m 51s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 48s. [2025-11-13 10:30:12,801][__main__][INFO] - Starting iteration 393. [2025-11-13 10:30:12,804][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:12,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:22,040][__main__][INFO] - Number of regex retries in iteration 393: 0 [2025-11-13 10:30:22,040][__main__][INFO] - agents played in iteration 393 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:30:22,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:22,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:22,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:22,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:22,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:22,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:23,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:23,647][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:24,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:24,627][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:24,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:25,277][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:25,927][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:26,252][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:26,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:27,882][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:28,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:28,533][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:28,859][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:29,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:29,509][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:29,834][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:30,158][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:30,808][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:31,134][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:31,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:32,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:32,442][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:33,090][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:33,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:34,436][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:35,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:35,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:35,176][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:36,146][__main__][INFO] - Iteration 394 took 23s (39.56% Gen, 56.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 0m 21s. Estimated total time: 19h 27m 8s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 31s. [2025-11-13 10:30:36,148][__main__][INFO] - Starting iteration 394. [2025-11-13 10:30:36,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:36,153][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:45,647][__main__][INFO] - Number of regex retries in iteration 394: 0 [2025-11-13 10:30:45,648][__main__][INFO] - agents played in iteration 394 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:30:46,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:46,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:46,205][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:47,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:47,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:48,559][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:48,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:49,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:49,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:50,185][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:50,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:51,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:51,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:52,138][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:52,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:53,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:53,444][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:54,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:55,405][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:56,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:57,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:57,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:58,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:58,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:58,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:58,763][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:59,721][__main__][INFO] - Iteration 395 took 23s (40.28% Gen, 55.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 19s. Estimated total time: 19h 38m 29s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 24s. [2025-11-13 10:30:59,723][__main__][INFO] - Starting iteration 395. [2025-11-13 10:30:59,726][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:59,727][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:08,938][__main__][INFO] - Number of regex retries in iteration 395: 0 [2025-11-13 10:31:08,939][__main__][INFO] - agents played in iteration 395 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:31:09,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:09,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:09,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:09,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:09,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:09,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:10,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:10,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:10,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:11,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:11,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:12,152][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:12,478][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:12,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:13,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:13,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:13,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:14,106][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:14,432][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:14,759][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:15,084][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:15,408][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:15,732][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:16,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:16,383][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:17,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:17,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:17,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:18,989][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:19,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:19,639][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:20,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:21,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:22,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:22,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:22,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:23,020][__main__][INFO] - Iteration 396 took 23s (39.55% Gen, 56.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 57m 9s. Estimated total time: 19h 24m 43s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 7s. [2025-11-13 10:31:23,022][__main__][INFO] - Starting iteration 396. [2025-11-13 10:31:23,025][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:23,025][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:32,392][__main__][INFO] - Number of regex retries in iteration 396: 0 [2025-11-13 10:31:32,392][__main__][INFO] - agents played in iteration 396 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:31:32,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:32,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:32,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:32,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:32,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:32,947][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:33,984][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:34,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:35,288][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:36,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:36,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:37,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:37,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:38,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:38,542][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:39,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:40,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:40,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:40,818][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:41,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:41,799][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:42,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:42,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:43,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:43,427][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:43,756][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:44,079][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:44,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:45,478][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:45,479][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:45,481][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:46,709][__main__][INFO] - Iteration 397 took 23s (39.55% Gen, 55.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 16s. Estimated total time: 19h 44m 14s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 22s. [2025-11-13 10:31:46,711][__main__][INFO] - Starting iteration 397. [2025-11-13 10:31:46,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:46,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:56,532][__main__][INFO] - Number of regex retries in iteration 397: 0 [2025-11-13 10:31:56,533][__main__][INFO] - agents played in iteration 397 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:31:56,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:57,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:57,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:57,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:58,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:58,788][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:59,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:00,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:00,415][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:00,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:01,393][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:01,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:02,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:02,368][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:02,693][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:03,019][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:03,343][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:03,993][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:04,318][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:04,643][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:05,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:05,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:06,269][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:06,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:07,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:08,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:08,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:09,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:09,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:09,683][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:10,710][__main__][INFO] - Iteration 398 took 23s (40.91% Gen, 54.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 27s. Estimated total time: 19h 59m 48s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 58s. [2025-11-13 10:32:10,712][__main__][INFO] - Starting iteration 398. [2025-11-13 10:32:10,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:10,717][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:20,225][__main__][INFO] - Number of regex retries in iteration 398: 0 [2025-11-13 10:32:20,226][__main__][INFO] - agents played in iteration 398 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:32:20,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:20,789][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:20,789][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:21,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:21,834][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:22,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:23,461][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:23,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:24,112][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:24,437][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:25,090][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:25,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:26,391][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:26,720][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:27,050][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:27,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:28,357][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:29,336][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:29,665][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:29,989][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:30,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:30,639][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:30,965][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:31,296][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:31,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:32,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:33,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:33,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:33,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:34,390][__main__][INFO] - Iteration 399 took 23s (40.16% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 59s. Estimated total time: 19h 43m 44s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 17s. [2025-11-13 10:32:34,392][__main__][INFO] - Starting iteration 399. [2025-11-13 10:32:34,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:34,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:44,070][__main__][INFO] - Number of regex retries in iteration 399: 0 [2025-11-13 10:32:44,070][__main__][INFO] - agents played in iteration 399 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:32:44,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:44,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:44,622][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:45,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:45,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:46,317][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:46,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:46,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:47,944][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:48,270][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:48,596][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:48,921][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:49,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:49,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:50,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:51,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:52,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:53,484][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:53,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:54,135][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:54,461][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:54,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:55,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:55,764][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:56,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:57,215][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:57,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:57,219][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:58,270][__main__][INFO] - Iteration 400 took 23s (40.51% Gen, 55.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 31s. Estimated total time: 19h 53m 40s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 56s. [2025-11-13 10:32:58,272][__main__][INFO] - Starting iteration 400. [2025-11-13 10:32:58,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:58,275][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:07,555][__main__][INFO] - Number of regex retries in iteration 400: 0 [2025-11-13 10:33:07,556][__main__][INFO] - agents played in iteration 400 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:33:08,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:08,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:08,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:08,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:08,107][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:08,107][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:09,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:09,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:09,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:10,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:10,771][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:12,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:12,722][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:13,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:13,372][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:13,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:14,351][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:14,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:15,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:15,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:16,633][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:16,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:17,292][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:17,622][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:17,954][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:18,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:18,606][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:18,930][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:19,256][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:19,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:20,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:20,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:20,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:22,534][__main__][INFO] - Iteration 401 took 24s (38.26% Gen, 54.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 26s. Estimated total time: 20h 13m 0s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 26s, 500 more iterations: 3h 22m 10s. [2025-11-13 10:33:22,536][__main__][INFO] - Starting iteration 401. [2025-11-13 10:33:22,539][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:33:22,540][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:30,514][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2025-11-13 10:33:31,676][__main__][INFO] - Number of regex retries in iteration 401: 1 [2025-11-13 10:33:31,676][__main__][INFO] - agents played in iteration 401 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:33:32,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:32,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:32,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:32,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:32,225][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:32,225][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:32,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:33,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:33,926][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:34,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:34,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:35,227][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:35,552][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:35,876][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:36,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:36,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:37,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:37,499][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:37,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:38,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:38,805][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:39,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:39,453][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:39,779][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:40,433][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:40,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:41,083][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:41,408][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:41,734][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:42,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:42,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:43,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:43,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:44,089][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:44,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:44,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:44,824][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:45,755][__main__][INFO] - Iteration 402 took 23s (39.35% Gen, 56.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 50m 52s. Estimated total time: 19h 20m 48s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 28s. [2025-11-13 10:33:45,757][__main__][INFO] - Starting iteration 402. [2025-11-13 10:33:45,760][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:33:45,760][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:55,458][__main__][INFO] - Number of regex retries in iteration 402: 0 [2025-11-13 10:33:55,458][__main__][INFO] - agents played in iteration 402 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:33:55,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:55,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:55,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:56,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:56,016][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:56,017][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:57,058][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:57,384][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:57,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:58,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:58,687][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:59,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:59,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:59,662][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:00,312][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:00,637][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:00,961][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:01,287][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:01,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:02,265][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:02,590][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:03,573][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:03,897][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:04,223][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:04,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:04,876][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:05,203][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:05,530][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:05,856][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:06,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:07,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:07,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:08,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:08,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:08,611][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:09,529][__main__][INFO] - Iteration 403 took 23s (40.80% Gen, 55.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 10s. Estimated total time: 19h 48m 30s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 5s. [2025-11-13 10:34:09,531][__main__][INFO] - Starting iteration 403. [2025-11-13 10:34:09,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:09,535][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:18,824][__main__][INFO] - Number of regex retries in iteration 403: 0 [2025-11-13 10:34:18,825][__main__][INFO] - agents played in iteration 403 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:34:19,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:19,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:19,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:19,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:19,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:19,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:20,421][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:21,073][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:22,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:22,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:22,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:23,024][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:23,349][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:23,674][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:24,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:24,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:24,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:25,297][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:25,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:26,597][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:27,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:27,905][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:28,232][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:28,887][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:29,213][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:30,186][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:30,516][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:31,228][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:31,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:31,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:31,947][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:32,851][__main__][INFO] - Iteration 404 took 23s (39.84% Gen, 56.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 55m 10s. Estimated total time: 19h 25m 54s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 19s. [2025-11-13 10:34:32,853][__main__][INFO] - Starting iteration 404. [2025-11-13 10:34:32,856][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:32,857][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:42,575][__main__][INFO] - Number of regex retries in iteration 404: 0 [2025-11-13 10:34:42,576][__main__][INFO] - agents played in iteration 404 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:34:43,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:43,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:43,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:43,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:43,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:43,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:44,817][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:45,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:45,792][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:46,117][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:46,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:46,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:47,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:47,420][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:47,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:48,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:48,396][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:48,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:49,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:49,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:49,708][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:50,688][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:51,342][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:51,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:51,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:52,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:52,646][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:52,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:53,297][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:53,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:53,950][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:54,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:55,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:55,729][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:55,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:55,732][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:56,639][__main__][INFO] - Iteration 405 took 23s (40.86% Gen, 55.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 4s. Estimated total time: 19h 49m 12s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 12s. [2025-11-13 10:34:56,641][__main__][INFO] - Starting iteration 405. [2025-11-13 10:34:56,644][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:56,645][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:06,130][__main__][INFO] - Number of regex retries in iteration 405: 0 [2025-11-13 10:35:06,130][__main__][INFO] - agents played in iteration 405 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:35:06,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:06,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:06,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:07,705][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:08,032][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:08,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:09,010][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:09,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:09,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:09,990][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:10,316][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:10,967][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:11,294][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:11,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:12,275][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:12,602][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:12,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:13,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:14,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:14,886][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:15,543][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:15,870][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:16,201][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:16,526][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:17,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:17,504][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:17,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:18,539][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:19,250][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:19,252][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:19,254][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:20,234][__main__][INFO] - Iteration 406 took 23s (40.21% Gen, 55.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 7m 59s. Estimated total time: 19h 39m 30s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 35s. [2025-11-13 10:35:20,236][__main__][INFO] - Starting iteration 406. [2025-11-13 10:35:20,239][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:20,240][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:29,047][__main__][INFO] - Number of regex retries in iteration 406: 0 [2025-11-13 10:35:29,048][__main__][INFO] - agents played in iteration 406 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:35:29,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:29,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:29,614][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:30,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:30,993][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:31,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:31,646][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:31,972][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:32,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:32,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:33,273][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:33,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:33,924][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:34,249][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:34,573][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:34,898][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:35,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:35,551][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:37,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:37,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:37,836][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:38,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:38,816][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:39,142][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:39,467][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:39,793][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:40,117][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:40,444][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:40,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:41,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:42,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:42,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:42,211][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:43,182][__main__][INFO] - Iteration 407 took 22s (38.39% Gen, 57.37% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 35m 16s. Estimated total time: 19h 7m 10s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 11s. [2025-11-13 10:35:43,184][__main__][INFO] - Starting iteration 407. [2025-11-13 10:35:43,187][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:43,188][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:52,404][__main__][INFO] - Number of regex retries in iteration 407: 0 [2025-11-13 10:35:52,404][__main__][INFO] - agents played in iteration 407 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:35:52,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:52,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:52,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:52,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:52,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:52,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:53,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:54,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:54,645][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:54,969][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:55,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:55,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:56,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:56,595][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:57,249][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:57,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:58,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:58,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:59,214][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:59,864][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:00,196][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:00,522][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:00,852][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:01,182][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:01,508][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:01,837][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:02,489][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:03,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:03,787][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:04,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:04,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:05,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:05,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:05,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:06,560][__main__][INFO] - Iteration 408 took 23s (39.43% Gen, 56.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 22s. Estimated total time: 19h 28m 40s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 46s. [2025-11-13 10:36:06,562][__main__][INFO] - Starting iteration 408. [2025-11-13 10:36:06,565][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:06,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:15,642][__main__][INFO] - Number of regex retries in iteration 408: 0 [2025-11-13 10:36:15,643][__main__][INFO] - agents played in iteration 408 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:36:16,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:16,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:16,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:17,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:17,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:18,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:18,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:18,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:19,211][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:19,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:19,862][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:20,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:20,514][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:21,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:21,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:22,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:22,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:22,793][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:23,118][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:23,444][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:23,770][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:24,423][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:24,749][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:25,075][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:25,401][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:25,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:26,051][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:26,701][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:27,025][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:27,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:28,392][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:29,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:29,119][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:29,121][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:30,079][__main__][INFO] - Iteration 409 took 23s (38.60% Gen, 57.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 3s. Estimated total time: 19h 35m 43s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 57s. [2025-11-13 10:36:30,082][__main__][INFO] - Starting iteration 409. [2025-11-13 10:36:30,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:30,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:39,442][__main__][INFO] - Number of regex retries in iteration 409: 0 [2025-11-13 10:36:39,442][__main__][INFO] - agents played in iteration 409 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:36:39,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:39,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:39,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:40,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:41,035][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:41,361][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:42,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:42,665][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:43,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:43,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:44,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:44,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:44,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:45,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:46,250][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:46,578][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:46,905][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:47,561][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:48,218][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:48,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:49,197][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:49,522][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:50,496][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:51,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:51,883][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:52,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:52,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:52,627][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:53,608][__main__][INFO] - Iteration 410 took 23s (39.77% Gen, 56.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 9s. Estimated total time: 19h 36m 14s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 2s. [2025-11-13 10:36:53,610][__main__][INFO] - Starting iteration 410. [2025-11-13 10:36:53,614][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:53,614][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:03,453][__main__][INFO] - Number of regex retries in iteration 410: 0 [2025-11-13 10:37:03,454][__main__][INFO] - agents played in iteration 410 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:37:03,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:03,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:03,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:04,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:04,007][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:04,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:05,377][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:06,026][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:06,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:06,678][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:07,330][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:07,986][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:08,318][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:08,643][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:08,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:09,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:09,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:09,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:10,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:10,917][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:11,241][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:11,565][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:13,191][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:14,165][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:14,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:14,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:15,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:15,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:16,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:16,762][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:16,764][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:18,548][__main__][INFO] - Iteration 411 took 24s (39.46% Gen, 53.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 16s. Estimated total time: 20h 46m 45s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 33s, 500 more iterations: 3h 27m 47s. [2025-11-13 10:37:18,551][__main__][INFO] - Starting iteration 411. [2025-11-13 10:37:18,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:37:18,554][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:27,527][__main__][INFO] - Number of regex retries in iteration 411: 0 [2025-11-13 10:37:27,528][__main__][INFO] - agents played in iteration 411 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:37:27,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:28,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:28,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:28,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:28,071][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:28,072][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:28,810][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:29,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:29,809][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:30,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:30,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:31,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:31,434][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:31,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:32,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:32,410][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:32,737][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:33,061][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:33,712][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:35,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:35,338][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:35,667][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:35,994][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:36,322][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:36,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:37,298][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:37,948][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:38,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:38,925][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:39,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:39,953][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:40,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:40,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:40,673][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:41,616][__main__][INFO] - Iteration 412 took 23s (38.90% Gen, 57.00% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 39m 18s. Estimated total time: 19h 13m 10s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 11s. [2025-11-13 10:37:41,618][__main__][INFO] - Starting iteration 412. [2025-11-13 10:37:41,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:37:41,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:51,498][__main__][INFO] - Number of regex retries in iteration 412: 0 [2025-11-13 10:37:51,498][__main__][INFO] - agents played in iteration 412 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:37:51,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:51,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:52,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:52,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:52,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:52,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:52,802][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:53,425][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:53,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:54,076][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:54,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:55,052][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:55,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:55,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:56,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:56,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:57,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:57,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:57,657][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:57,982][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:58,307][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:58,632][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:58,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:59,615][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:00,592][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:00,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:01,248][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:01,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:01,904][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:02,231][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:02,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:02,883][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:03,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:03,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:04,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:04,672][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:04,674][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:05,597][__main__][INFO] - Iteration 413 took 23s (41.19% Gen, 54.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 31s. Estimated total time: 19h 58m 47s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 47s. [2025-11-13 10:38:05,600][__main__][INFO] - Starting iteration 413. [2025-11-13 10:38:05,602][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:05,602][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:14,924][__main__][INFO] - Number of regex retries in iteration 413: 0 [2025-11-13 10:38:14,924][__main__][INFO] - agents played in iteration 413 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:38:15,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:15,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:15,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:16,211][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:16,508][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:16,833][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:17,158][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:17,807][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:18,457][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:18,782][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:19,757][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:20,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:20,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:21,058][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:21,387][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:21,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:22,042][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:22,369][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:23,026][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:23,354][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:23,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:24,672][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:24,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:25,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:25,972][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:26,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:26,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:27,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:28,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:28,077][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:28,079][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:29,159][__main__][INFO] - Iteration 414 took 23s (39.57% Gen, 55.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 12s. Estimated total time: 19h 37m 52s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 18s. [2025-11-13 10:38:29,161][__main__][INFO] - Starting iteration 414. [2025-11-13 10:38:29,164][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:29,165][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:38,760][__main__][INFO] - Number of regex retries in iteration 414: 0 [2025-11-13 10:38:38,761][__main__][INFO] - agents played in iteration 414 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:38:39,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:39,312][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:39,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:40,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:40,366][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:40,694][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:41,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:41,672][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:41,997][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:42,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:42,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:42,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:43,626][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:43,953][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:44,280][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:44,930][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:45,259][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:45,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:45,911][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:47,866][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:48,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:48,522][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:48,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:49,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:49,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:50,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:50,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:51,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:51,936][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:51,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:51,939][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:52,913][__main__][INFO] - Iteration 415 took 23s (40.40% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 26s. Estimated total time: 19h 47m 30s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 55s. [2025-11-13 10:38:52,916][__main__][INFO] - Starting iteration 415. [2025-11-13 10:38:52,919][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:52,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:01,802][__main__][INFO] - Number of regex retries in iteration 415: 0 [2025-11-13 10:39:01,803][__main__][INFO] - agents played in iteration 415 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:39:02,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:02,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:02,703][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:03,763][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:04,089][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:04,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:05,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:06,365][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:07,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:08,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:08,963][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:09,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:09,939][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:10,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:11,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:11,918][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:12,246][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:12,575][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:12,903][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:13,558][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:13,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:14,612][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:15,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:15,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:15,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:16,523][__main__][INFO] - Iteration 416 took 23s (37.63% Gen, 57.41% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 4m 48s. Estimated total time: 19h 40m 15s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 42s. [2025-11-13 10:39:16,525][__main__][INFO] - Starting iteration 416. [2025-11-13 10:39:16,528][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:16,529][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:26,192][__main__][INFO] - Number of regex retries in iteration 416: 0 [2025-11-13 10:39:26,193][__main__][INFO] - agents played in iteration 416 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:39:26,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:26,741][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:26,742][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:27,489][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:27,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:28,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:28,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:29,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:29,414][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:30,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:30,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:31,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:32,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:32,999][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:33,327][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:34,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:34,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:35,282][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:35,932][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:36,256][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:36,580][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:36,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:37,232][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:37,557][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:37,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:38,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:39,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:39,321][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:39,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:40,298][__main__][INFO] - Iteration 417 took 23s (40.65% Gen, 55.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 40s. Estimated total time: 19h 48m 31s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 5s. [2025-11-13 10:39:40,300][__main__][INFO] - Starting iteration 417. [2025-11-13 10:39:40,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:40,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:49,241][__main__][INFO] - Number of regex retries in iteration 417: 0 [2025-11-13 10:39:49,241][__main__][INFO] - agents played in iteration 417 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:39:49,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:49,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:49,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:50,547][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:50,843][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:51,169][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:51,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:52,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:52,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:53,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:53,777][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:55,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:55,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:55,732][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:56,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:57,036][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:57,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:58,669][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:58,995][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:59,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:59,644][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:59,969][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:00,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:00,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:01,659][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:02,372][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:02,373][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:02,375][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:03,585][__main__][INFO] - Iteration 418 took 23s (38.39% Gen, 56.41% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 47m 50s. Estimated total time: 19h 24m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 0s. [2025-11-13 10:40:03,587][__main__][INFO] - Starting iteration 418. [2025-11-13 10:40:03,590][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:03,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:13,078][__main__][INFO] - Number of regex retries in iteration 418: 0 [2025-11-13 10:40:13,078][__main__][INFO] - agents played in iteration 418 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:40:13,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:13,626][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:13,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:14,378][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:14,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:15,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:15,326][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:15,652][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:15,976][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:16,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:17,925][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:18,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:18,902][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:19,227][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:19,553][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:19,879][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:20,206][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:20,532][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:20,857][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:21,513][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:22,171][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:23,150][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:23,477][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:23,802][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:24,128][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:24,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:24,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:25,488][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:26,226][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:26,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:26,229][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:27,284][__main__][INFO] - Iteration 419 took 23s (40.04% Gen, 55.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 8m 8s. Estimated total time: 19h 44m 46s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 27s. [2025-11-13 10:40:27,287][__main__][INFO] - Starting iteration 419. [2025-11-13 10:40:27,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:27,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:36,926][__main__][INFO] - Number of regex retries in iteration 419: 0 [2025-11-13 10:40:36,927][__main__][INFO] - agents played in iteration 419 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:40:37,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:37,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:37,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:38,231][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:38,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:38,852][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:39,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:39,829][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:40,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:40,479][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:40,805][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:41,131][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:41,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:41,785][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:42,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:42,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:42,763][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:43,087][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:43,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:43,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:44,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:44,393][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:44,721][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:45,376][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:45,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:46,027][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:46,677][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:47,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:47,325][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:47,649][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:47,974][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:48,299][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:48,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:49,360][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:50,093][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:50,095][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:50,096][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:51,041][__main__][INFO] - Iteration 420 took 23s (40.57% Gen, 55.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 10m 36s. Estimated total time: 19h 47m 38s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 56s. [2025-11-13 10:40:51,043][__main__][INFO] - Starting iteration 420. [2025-11-13 10:40:51,047][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:51,048][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:00,230][__main__][INFO] - Number of regex retries in iteration 420: 0 [2025-11-13 10:41:00,230][__main__][INFO] - agents played in iteration 420 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:41:00,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:00,783][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:00,784][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:01,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:01,824][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:02,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:02,476][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:02,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:03,126][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:03,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:04,104][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:04,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:05,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:05,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:06,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:06,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:07,048][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:07,373][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:07,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:08,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:08,349][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:09,006][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:09,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:09,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:10,304][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:11,604][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:11,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:12,656][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:13,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:13,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:13,404][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:15,252][__main__][INFO] - Iteration 421 took 24s (37.93% Gen, 54.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 51s. Estimated total time: 20h 10m 17s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 20s, 500 more iterations: 3h 21m 42s. [2025-11-13 10:41:15,254][__main__][INFO] - Starting iteration 421. [2025-11-13 10:41:15,257][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:41:15,257][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:24,748][__main__][INFO] - Number of regex retries in iteration 421: 0 [2025-11-13 10:41:24,748][__main__][INFO] - agents played in iteration 421 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:41:25,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:25,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:25,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:25,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:25,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:25,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:26,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:26,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:26,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:26,978][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:27,304][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:27,955][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:28,280][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:28,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:28,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:29,582][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:29,906][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:30,231][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:30,556][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:30,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:31,533][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:31,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:32,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:32,523][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:32,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:33,179][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:33,837][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:34,162][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:34,487][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:34,811][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:35,461][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:35,787][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:36,112][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:36,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:37,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:37,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:37,892][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:37,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:38,900][__main__][INFO] - Iteration 422 took 23s (40.14% Gen, 55.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 4m 23s. Estimated total time: 19h 42m 12s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 10:41:38,902][__main__][INFO] - Starting iteration 422. [2025-11-13 10:41:38,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:41:38,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:48,312][__main__][INFO] - Number of regex retries in iteration 422: 0 [2025-11-13 10:41:48,313][__main__][INFO] - agents played in iteration 422 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:41:48,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:48,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:48,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:48,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:48,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:48,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:49,655][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:50,277][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:50,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:50,929][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:51,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:51,580][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:51,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:52,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:52,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:53,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:54,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:55,487][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:55,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:56,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:56,469][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:56,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:57,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:58,108][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:58,432][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:58,757][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:59,081][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:59,730][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:00,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:00,779][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:01,517][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:01,518][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:01,520][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:02,496][__main__][INFO] - Iteration 423 took 23s (39.87% Gen, 55.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 19s. Estimated total time: 19h 39m 32s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 35s. [2025-11-13 10:42:02,498][__main__][INFO] - Starting iteration 423. [2025-11-13 10:42:02,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:02,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:12,238][__main__][INFO] - Number of regex retries in iteration 423: 0 [2025-11-13 10:42:12,239][__main__][INFO] - agents played in iteration 423 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:42:12,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:12,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:12,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:13,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:13,837][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:14,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:14,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:14,814][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:15,141][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:16,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:16,765][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:17,090][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:17,420][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:17,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:18,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:18,406][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:18,734][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:19,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:19,712][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:20,039][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:20,688][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:21,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:21,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:21,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:21,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:22,322][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:22,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:22,969][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:23,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:23,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:24,662][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:25,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:25,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:25,389][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:26,364][__main__][INFO] - Iteration 424 took 23s (40.80% Gen, 55.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 32s. Estimated total time: 19h 53m 10s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 51s. [2025-11-13 10:42:26,366][__main__][INFO] - Starting iteration 424. [2025-11-13 10:42:26,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:26,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:36,011][__main__][INFO] - Number of regex retries in iteration 424: 0 [2025-11-13 10:42:36,012][__main__][INFO] - agents played in iteration 424 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:42:36,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:36,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:36,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:36,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:36,564][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:36,564][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:37,311][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:38,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:38,911][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:39,886][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:40,536][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:40,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:41,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:41,843][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:42,168][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:42,817][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:43,472][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:43,798][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:44,124][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:44,451][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:44,777][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:45,426][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:46,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:46,399][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:46,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:47,048][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:47,372][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:47,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:48,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:49,115][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:49,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:49,118][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:50,039][__main__][INFO] - Iteration 425 took 23s (40.73% Gen, 55.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 4m 32s. Estimated total time: 19h 43m 33s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 15s. [2025-11-13 10:42:50,042][__main__][INFO] - Starting iteration 425. [2025-11-13 10:42:50,045][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:50,045][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:59,768][__main__][INFO] - Number of regex retries in iteration 425: 0 [2025-11-13 10:42:59,769][__main__][INFO] - agents played in iteration 425 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:43:00,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:00,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:00,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:00,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:00,313][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:00,314][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:01,057][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:02,005][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:02,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:03,961][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:04,292][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:04,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:05,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:05,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:05,923][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:06,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:06,574][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:07,551][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:08,201][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:08,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:08,857][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:09,181][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:10,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:10,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:11,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:11,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:12,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:12,890][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:12,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:12,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:13,797][__main__][INFO] - Iteration 426 took 23s (40.94% Gen, 55.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 8m 15s. Estimated total time: 19h 47m 39s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 56s. [2025-11-13 10:43:13,799][__main__][INFO] - Starting iteration 426. [2025-11-13 10:43:13,802][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:13,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:22,928][__main__][INFO] - Number of regex retries in iteration 426: 0 [2025-11-13 10:43:22,929][__main__][INFO] - agents played in iteration 426 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:43:23,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:23,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:23,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:23,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:23,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:23,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:24,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:24,530][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:24,855][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:25,181][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:25,507][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:25,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:26,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:26,482][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:26,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:27,132][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:27,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:27,782][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:28,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:28,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:28,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:29,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:29,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:31,702][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:32,030][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:32,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:32,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:33,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:33,659][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:34,310][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:34,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:35,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:36,064][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:36,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:36,067][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:37,000][__main__][INFO] - Iteration 427 took 23s (39.34% Gen, 56.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 40m 10s. Estimated total time: 19h 19m 58s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 19s. [2025-11-13 10:43:37,003][__main__][INFO] - Starting iteration 427. [2025-11-13 10:43:37,006][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:37,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:45,973][__main__][INFO] - Number of regex retries in iteration 427: 0 [2025-11-13 10:43:45,974][__main__][INFO] - agents played in iteration 427 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:43:46,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:46,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:46,522][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:47,274][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:47,571][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:47,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:48,874][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:49,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:49,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:50,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:51,151][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:51,480][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:51,806][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:52,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:52,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:53,109][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:53,434][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:54,084][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:54,737][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:55,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:55,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:55,718][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:56,042][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:56,369][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:57,020][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:57,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:57,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:58,375][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:59,095][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:59,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:59,098][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:00,070][__main__][INFO] - Iteration 428 took 23s (38.88% Gen, 56.90% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 33m 4s. Estimated total time: 19h 13m 14s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 12s. [2025-11-13 10:44:00,072][__main__][INFO] - Starting iteration 428. [2025-11-13 10:44:00,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:00,076][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:09,307][__main__][INFO] - Number of regex retries in iteration 428: 0 [2025-11-13 10:44:09,307][__main__][INFO] - agents played in iteration 428 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:44:09,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:09,854][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:09,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:10,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:11,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:11,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:11,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:12,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:12,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:13,207][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:13,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:14,185][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:14,510][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:14,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:15,493][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:16,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:17,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:17,784][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:18,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:18,439][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:18,766][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:19,419][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:20,068][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:20,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:21,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:21,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:22,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:22,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:22,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:22,829][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:23,805][__main__][INFO] - Iteration 429 took 23s (38.90% Gen, 56.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 5m 58s. Estimated total time: 19h 46m 33s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 45s. [2025-11-13 10:44:23,808][__main__][INFO] - Starting iteration 429. [2025-11-13 10:44:23,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:23,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:33,373][__main__][INFO] - Number of regex retries in iteration 429: 0 [2025-11-13 10:44:33,374][__main__][INFO] - agents played in iteration 429 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:44:33,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:33,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:33,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:33,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:33,930][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:33,931][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:34,675][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:35,298][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:35,623][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:36,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:37,253][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:37,579][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:37,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:38,231][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:38,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:38,883][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:39,210][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:39,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:39,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:40,186][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:40,511][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:40,836][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:41,489][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:41,816][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:42,144][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:42,469][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:42,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:43,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:43,780][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:44,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:44,759][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:45,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:45,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:46,521][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:46,522][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:46,524][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:47,495][__main__][INFO] - Iteration 430 took 23s (40.38% Gen, 55.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 13s. Estimated total time: 19h 44m 11s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 21s. [2025-11-13 10:44:47,497][__main__][INFO] - Starting iteration 430. [2025-11-13 10:44:47,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:47,501][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:57,330][__main__][INFO] - Number of regex retries in iteration 430: 0 [2025-11-13 10:44:57,331][__main__][INFO] - agents played in iteration 430 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:44:57,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:57,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:57,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:58,940][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:59,266][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:59,591][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:59,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:00,241][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:00,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:01,220][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:01,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:01,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:02,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:03,836][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:04,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:04,488][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:04,814][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:05,140][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:05,470][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:05,800][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:06,126][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:06,454][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:06,778][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:07,754][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:08,408][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:08,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:09,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:09,784][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:10,506][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:10,508][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:10,509][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:12,385][__main__][INFO] - Iteration 431 took 24s (39.50% Gen, 52.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 55s. Estimated total time: 20h 44m 18s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 28s, 500 more iterations: 3h 27m 23s. [2025-11-13 10:45:12,388][__main__][INFO] - Starting iteration 431. [2025-11-13 10:45:12,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:45:12,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:22,161][__main__][INFO] - Number of regex retries in iteration 431: 0 [2025-11-13 10:45:22,162][__main__][INFO] - agents played in iteration 431 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:45:22,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:22,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:22,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:22,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:22,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:22,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:23,742][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:24,395][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:24,721][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:25,373][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:25,698][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:26,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:26,674][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:27,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:27,327][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:27,978][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:28,632][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:28,960][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:29,285][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:29,939][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:30,589][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:30,913][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:31,238][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:31,561][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:31,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:32,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:32,535][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:33,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:33,512][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:33,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:34,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:35,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:35,280][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:35,281][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:36,219][__main__][INFO] - Iteration 432 took 23s (41.01% Gen, 55.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 39s. Estimated total time: 19h 51m 26s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 34s. [2025-11-13 10:45:36,221][__main__][INFO] - Starting iteration 432. [2025-11-13 10:45:36,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:45:36,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:46,006][__main__][INFO] - Number of regex retries in iteration 432: 0 [2025-11-13 10:45:46,007][__main__][INFO] - agents played in iteration 432 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:45:46,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:46,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:46,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:46,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:46,571][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:46,571][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:47,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:47,604][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:47,931][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:48,257][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:48,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:49,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:50,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:51,202][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:51,528][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:51,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:52,181][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:53,163][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:54,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:54,476][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:54,805][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:55,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:55,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:55,781][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:56,105][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:56,432][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:56,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:57,413][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:57,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:58,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:59,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:59,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:59,183][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:00,101][__main__][INFO] - Iteration 433 took 23s (40.97% Gen, 55.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 43s. Estimated total time: 19h 53m 54s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 59s. [2025-11-13 10:46:00,103][__main__][INFO] - Starting iteration 433. [2025-11-13 10:46:00,106][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:00,107][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:09,128][__main__][INFO] - Number of regex retries in iteration 433: 0 [2025-11-13 10:46:09,129][__main__][INFO] - agents played in iteration 433 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:46:09,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:09,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:09,677][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:10,429][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:10,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:11,050][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:11,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:12,027][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:12,352][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:12,679][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:14,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:15,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:15,624][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:15,955][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:16,287][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:16,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:16,945][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:17,274][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:17,928][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:18,255][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:18,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:18,912][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:19,236][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:19,562][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:19,889][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:20,216][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:20,541][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:20,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:21,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:22,297][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:22,298][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:22,299][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:23,269][__main__][INFO] - Iteration 434 took 23s (38.95% Gen, 56.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 35s. Estimated total time: 19h 18m 9s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 1s. [2025-11-13 10:46:23,271][__main__][INFO] - Starting iteration 434. [2025-11-13 10:46:23,274][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:23,275][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:32,576][__main__][INFO] - Number of regex retries in iteration 434: 0 [2025-11-13 10:46:32,576][__main__][INFO] - agents played in iteration 434 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:46:33,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:33,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:33,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:34,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:34,841][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:35,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:35,818][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:36,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:36,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:37,122][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:37,450][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:37,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:39,083][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:39,407][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:41,700][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:42,355][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:42,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:43,005][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:43,333][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:43,658][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:44,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:45,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:45,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:45,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:45,742][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:46,731][__main__][INFO] - Iteration 435 took 23s (39.65% Gen, 56.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 49m 53s. Estimated total time: 19h 32m 51s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 28s. [2025-11-13 10:46:46,733][__main__][INFO] - Starting iteration 435. [2025-11-13 10:46:46,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:46,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:56,439][__main__][INFO] - Number of regex retries in iteration 435: 0 [2025-11-13 10:46:56,440][__main__][INFO] - agents played in iteration 435 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:46:56,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:56,996][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:56,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:58,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:58,359][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:59,012][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:59,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:59,663][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:59,992][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:00,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:00,971][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:01,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:01,948][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:02,273][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:02,598][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:02,924][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:03,576][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:04,881][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:05,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:05,528][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:06,177][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:06,830][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:07,480][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:07,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:08,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:08,852][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:09,577][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:09,579][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:09,580][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:10,495][__main__][INFO] - Iteration 436 took 23s (40.84% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 4m 39s. Estimated total time: 19h 48m 1s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 36s, 500 more iterations: 3h 18m 0s. [2025-11-13 10:47:10,498][__main__][INFO] - Starting iteration 436. [2025-11-13 10:47:10,501][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:10,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:19,069][__main__][INFO] - Number of regex retries in iteration 436: 0 [2025-11-13 10:47:19,070][__main__][INFO] - agents played in iteration 436 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:47:19,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:19,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:19,622][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:20,371][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:20,669][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:20,994][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:21,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:21,644][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:22,299][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:22,624][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:22,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:23,275][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:23,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:23,930][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:24,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:24,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:24,913][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:25,238][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:25,566][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:25,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:26,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:26,545][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:26,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:27,199][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:27,524][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:27,852][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:28,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:28,507][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:28,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:29,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:29,810][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:30,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:30,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:30,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:31,501][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:32,224][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:32,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:32,227][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:33,200][__main__][INFO] - Iteration 437 took 22s (37.75% Gen, 57.96% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 11m 15s. Estimated total time: 18h 54m 59s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 9s. [2025-11-13 10:47:33,202][__main__][INFO] - Starting iteration 437. [2025-11-13 10:47:33,205][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:33,206][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:42,631][__main__][INFO] - Number of regex retries in iteration 437: 0 [2025-11-13 10:47:42,632][__main__][INFO] - agents played in iteration 437 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:47:43,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:43,188][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:43,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:44,238][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:44,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:44,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:45,215][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:45,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:45,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:46,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:46,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:47,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:47,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:48,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:49,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:49,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:50,446][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:50,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:51,103][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:51,432][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:52,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:52,412][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:52,740][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:53,070][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:53,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:54,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:54,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:55,084][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:55,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:55,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:55,822][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:56,815][__main__][INFO] - Iteration 438 took 23s (39.92% Gen, 55.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 24s. Estimated total time: 19h 40m 32s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 45s. [2025-11-13 10:47:56,817][__main__][INFO] - Starting iteration 438. [2025-11-13 10:47:56,820][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:56,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:06,131][__main__][INFO] - Number of regex retries in iteration 438: 0 [2025-11-13 10:48:06,132][__main__][INFO] - agents played in iteration 438 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:48:06,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:06,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:06,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:06,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:06,680][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:06,681][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:07,434][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:07,731][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:08,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:08,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:08,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:09,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:09,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:10,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:11,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:11,326][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:11,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:11,977][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:12,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:12,627][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:12,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:13,282][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:13,608][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:13,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:14,585][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:14,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:15,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:15,895][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:16,222][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:16,547][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:17,198][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:17,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:18,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:19,304][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:19,306][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:19,307][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:20,258][__main__][INFO] - Iteration 439 took 23s (39.72% Gen, 56.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 47m 25s. Estimated total time: 19h 31m 56s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 19s. [2025-11-13 10:48:20,260][__main__][INFO] - Starting iteration 439. [2025-11-13 10:48:20,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:20,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:29,917][__main__][INFO] - Number of regex retries in iteration 439: 0 [2025-11-13 10:48:29,917][__main__][INFO] - agents played in iteration 439 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:48:30,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:30,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:30,470][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:31,216][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:31,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:32,165][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:32,817][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:33,143][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:33,470][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:34,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:35,428][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:35,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:36,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:36,738][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:37,064][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:37,392][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:38,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:38,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:38,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:39,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:39,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:40,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:40,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:40,984][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:41,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:41,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:42,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:43,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:43,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:43,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:44,071][__main__][INFO] - Iteration 440 took 23s (40.54% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 5m 28s. Estimated total time: 19h 50m 22s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 23s. [2025-11-13 10:48:44,073][__main__][INFO] - Starting iteration 440. [2025-11-13 10:48:44,076][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:44,077][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:52,873][__main__][INFO] - Number of regex retries in iteration 440: 0 [2025-11-13 10:48:52,873][__main__][INFO] - agents played in iteration 440 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:48:53,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:53,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:53,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:53,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:53,422][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:53,422][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:54,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:54,465][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:54,791][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:55,115][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:55,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:55,767][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:56,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:57,072][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:57,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:58,382][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:58,707][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:59,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:59,683][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:00,010][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:00,336][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:00,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:00,996][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:01,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:01,972][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:02,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:02,623][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:02,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:03,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:03,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:03,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:04,251][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:04,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:05,297][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:06,014][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:06,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:06,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:07,873][__main__][INFO] - Iteration 441 took 23s (36.97% Gen, 55.23% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 4m 35s. Estimated total time: 19h 49m 54s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 19s. [2025-11-13 10:49:07,875][__main__][INFO] - Starting iteration 441. [2025-11-13 10:49:07,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:07,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:17,160][__main__][INFO] - Number of regex retries in iteration 441: 0 [2025-11-13 10:49:17,161][__main__][INFO] - agents played in iteration 441 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:49:17,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:17,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:17,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:17,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:17,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:17,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:18,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:18,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:19,145][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:19,797][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:20,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:20,773][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:21,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:21,425][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:21,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:22,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:22,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:23,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:24,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:24,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:25,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:25,348][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:26,002][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:26,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:26,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:26,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:27,308][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:27,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:28,285][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:28,610][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:28,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:29,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:30,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:30,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:30,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:31,345][__main__][INFO] - Iteration 442 took 23s (39.55% Gen, 56.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 47m 41s. Estimated total time: 19h 33m 23s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 10:49:31,347][__main__][INFO] - Starting iteration 442. [2025-11-13 10:49:31,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:31,350][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:40,729][__main__][INFO] - Number of regex retries in iteration 442: 0 [2025-11-13 10:49:40,729][__main__][INFO] - agents played in iteration 442 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:49:41,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:41,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:41,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:41,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:41,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:41,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:42,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:42,319][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:42,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:42,970][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:43,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:44,271][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:44,597][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:44,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:45,573][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:45,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:46,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:46,551][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:46,875][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:47,201][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:47,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:48,176][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:48,502][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:48,828][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:49,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:49,479][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:50,461][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:50,789][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:51,113][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:51,441][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:52,091][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:52,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:53,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:53,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:53,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:53,862][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:54,782][__main__][INFO] - Iteration 443 took 23s (40.03% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 45m 34s. Estimated total time: 19h 31m 40s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 16s. [2025-11-13 10:49:54,784][__main__][INFO] - Starting iteration 443. [2025-11-13 10:49:54,787][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:54,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:03,899][__main__][INFO] - Number of regex retries in iteration 443: 0 [2025-11-13 10:50:03,900][__main__][INFO] - agents played in iteration 443 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:50:04,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:04,453][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:04,453][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:05,200][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:05,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:05,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:06,147][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:06,473][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:06,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:07,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:07,777][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:08,430][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:08,755][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:09,080][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:09,405][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:09,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:10,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:10,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:11,039][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:11,365][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:11,692][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:12,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:13,000][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:13,651][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:13,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:14,301][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:14,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:14,958][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:15,285][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:15,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:16,348][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:17,080][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:17,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:17,083][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:18,100][__main__][INFO] - Iteration 444 took 23s (39.08% Gen, 56.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 39m 13s. Estimated total time: 19h 25m 42s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 17s. [2025-11-13 10:50:18,102][__main__][INFO] - Starting iteration 444. [2025-11-13 10:50:18,105][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:18,105][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:27,927][__main__][INFO] - Number of regex retries in iteration 444: 0 [2025-11-13 10:50:27,927][__main__][INFO] - agents played in iteration 444 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:50:28,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:28,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:28,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:29,223][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:29,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:29,845][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:30,496][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:31,473][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:33,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:33,423][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:33,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:34,073][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:34,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:35,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:35,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:35,694][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:36,020][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:36,671][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:37,324][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:37,652][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:37,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:38,302][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:38,625][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:39,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:39,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:40,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:41,035][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:41,037][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:41,039][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:41,993][__main__][INFO] - Iteration 445 took 23s (41.12% Gen, 54.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 7m 34s. Estimated total time: 19h 54m 27s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 4s. [2025-11-13 10:50:41,995][__main__][INFO] - Starting iteration 445. [2025-11-13 10:50:41,998][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:41,999][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:51,134][__main__][INFO] - Number of regex retries in iteration 445: 0 [2025-11-13 10:50:51,135][__main__][INFO] - agents played in iteration 445 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:50:51,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:51,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:51,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:52,731][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:53,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:54,035][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:54,361][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:54,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:55,011][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:55,667][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:56,318][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:56,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:56,969][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:57,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:57,944][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:58,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:58,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:59,254][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:59,584][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:00,238][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:00,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:01,218][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:02,200][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:02,526][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:02,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:03,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:04,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:04,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:04,301][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:05,431][__main__][INFO] - Iteration 446 took 23s (38.98% Gen, 56.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 44m 23s. Estimated total time: 19h 31m 40s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 16s. [2025-11-13 10:51:05,433][__main__][INFO] - Starting iteration 446. [2025-11-13 10:51:05,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:05,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:15,326][__main__][INFO] - Number of regex retries in iteration 446: 0 [2025-11-13 10:51:15,327][__main__][INFO] - agents played in iteration 446 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:51:15,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:15,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:15,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:15,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:15,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:15,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:16,613][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:17,235][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:17,885][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:18,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:18,861][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:19,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:19,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:19,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:20,488][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:21,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:22,112][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:22,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:22,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:23,085][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:23,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:24,060][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:24,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:24,712][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:25,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:25,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:26,012][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:26,337][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:26,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:27,697][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:28,423][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:28,425][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:28,426][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:29,327][__main__][INFO] - Iteration 447 took 23s (41.40% Gen, 54.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 55s. Estimated total time: 19h 54m 35s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 5s. [2025-11-13 10:51:29,328][__main__][INFO] - Starting iteration 447. [2025-11-13 10:51:29,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:29,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:38,585][__main__][INFO] - Number of regex retries in iteration 447: 0 [2025-11-13 10:51:38,586][__main__][INFO] - agents played in iteration 447 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:51:39,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:39,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:39,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:39,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:39,141][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:39,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:40,181][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:40,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:40,832][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:41,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:41,808][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:42,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:42,458][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:42,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:43,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:43,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:44,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:44,744][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:45,072][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:45,729][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:46,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:48,034][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:48,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:48,686][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:49,009][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:49,335][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:49,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:50,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:51,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:51,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:51,734][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:51,736][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:52,659][__main__][INFO] - Iteration 448 took 23s (39.67% Gen, 56.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 38m 23s. Estimated total time: 19h 26m 27s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 24s. [2025-11-13 10:51:52,662][__main__][INFO] - Starting iteration 448. [2025-11-13 10:51:52,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:52,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:02,226][__main__][INFO] - Number of regex retries in iteration 448: 0 [2025-11-13 10:52:02,227][__main__][INFO] - agents played in iteration 448 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:52:02,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:02,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:02,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:02,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:02,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:02,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:03,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:03,805][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:04,131][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:04,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:04,780][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:06,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:07,387][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:07,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:08,037][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:08,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:08,687][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:09,010][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:09,335][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:09,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:09,984][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:11,289][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:12,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:13,240][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:13,567][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:13,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:14,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:15,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:15,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:15,347][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:16,232][__main__][INFO] - Iteration 449 took 23s (40.57% Gen, 55.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 49m 57s. Estimated total time: 19h 38m 24s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 24s. [2025-11-13 10:52:16,235][__main__][INFO] - Starting iteration 449. [2025-11-13 10:52:16,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:16,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:25,148][__main__][INFO] - Number of regex retries in iteration 449: 0 [2025-11-13 10:52:25,148][__main__][INFO] - agents played in iteration 449 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:52:25,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:25,697][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:25,697][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:26,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:27,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:27,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:27,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:28,373][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:29,024][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:29,675][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:30,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:30,325][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:30,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:30,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:31,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:32,276][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:32,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:33,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:33,575][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:33,905][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:34,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:34,883][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:35,532][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:36,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:37,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:38,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:38,286][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:38,288][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:39,261][__main__][INFO] - Iteration 450 took 23s (38.70% Gen, 57.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 22m 22s. Estimated total time: 19h 11m 12s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 52s. [2025-11-13 10:52:39,263][__main__][INFO] - Starting iteration 450. [2025-11-13 10:52:39,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:39,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:49,102][__main__][INFO] - Number of regex retries in iteration 450: 0 [2025-11-13 10:52:49,102][__main__][INFO] - agents played in iteration 450 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:52:49,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:49,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:49,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:49,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:49,654][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:49,655][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:50,392][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:50,689][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:51,014][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:51,340][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:51,990][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:52,316][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:52,641][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:53,293][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:53,620][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:53,947][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:54,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:54,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:55,576][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:55,901][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:56,227][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:56,553][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:56,877][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:57,202][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:57,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:57,854][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:58,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:58,505][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:58,831][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:59,809][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:00,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:01,506][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:02,226][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:02,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:02,229][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:04,059][__main__][INFO] - Iteration 451 took 24s (39.67% Gen, 52.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 50m 26s. Estimated total time: 20h 39m 41s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 19s, 500 more iterations: 3h 26m 36s. [2025-11-13 10:53:04,061][__main__][INFO] - Starting iteration 451. [2025-11-13 10:53:04,064][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:04,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:12,801][__main__][INFO] - Number of regex retries in iteration 451: 0 [2025-11-13 10:53:12,801][__main__][INFO] - agents played in iteration 451 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:53:13,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:13,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:13,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:13,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:13,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:13,356][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:14,120][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:14,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:14,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:15,068][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:15,392][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:15,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:16,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:17,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:17,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:18,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:18,647][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:18,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:19,299][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:19,624][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:20,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:20,598][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:21,254][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:21,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:21,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:22,238][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:22,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:22,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:23,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:23,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:24,193][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:24,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:25,236][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:25,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:25,953][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:25,955][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:26,944][__main__][INFO] - Iteration 452 took 22s (38.19% Gen, 57.48% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 14m 25s. Estimated total time: 19h 4m 3s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 40s. [2025-11-13 10:53:26,946][__main__][INFO] - Starting iteration 452. [2025-11-13 10:53:26,950][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:26,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:36,205][__main__][INFO] - Number of regex retries in iteration 452: 0 [2025-11-13 10:53:36,206][__main__][INFO] - agents played in iteration 452 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:53:36,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:36,761][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:36,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:37,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:37,814][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:38,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:38,465][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:38,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:39,116][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:39,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:39,767][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:40,093][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:40,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:40,743][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:41,069][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:41,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:41,723][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:42,049][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:42,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:43,026][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:43,677][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:44,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:44,327][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:44,651][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:44,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:45,301][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:45,626][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:46,605][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:46,931][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:47,258][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:47,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:48,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:49,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:49,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:49,354][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:50,323][__main__][INFO] - Iteration 453 took 23s (39.60% Gen, 56.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 38m 41s. Estimated total time: 19h 28m 42s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 47s. [2025-11-13 10:53:50,325][__main__][INFO] - Starting iteration 453. [2025-11-13 10:53:50,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:50,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:59,928][__main__][INFO] - Number of regex retries in iteration 453: 0 [2025-11-13 10:53:59,929][__main__][INFO] - agents played in iteration 453 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:54:00,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:00,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:00,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:00,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:00,487][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:00,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:01,521][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:01,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:02,173][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:02,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:03,149][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:03,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:04,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:04,450][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:04,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:05,102][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:05,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:07,397][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:07,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:08,054][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:08,378][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:08,706][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:09,034][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:09,692][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:10,021][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:10,347][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:10,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:11,004][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:11,334][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:11,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:12,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:13,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:13,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:13,119][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:14,087][__main__][INFO] - Iteration 454 took 23s (40.40% Gen, 55.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 57m 32s. Estimated total time: 19h 47m 56s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 59s. [2025-11-13 10:54:14,089][__main__][INFO] - Starting iteration 454. [2025-11-13 10:54:14,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:14,092][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:23,797][__main__][INFO] - Number of regex retries in iteration 454: 0 [2025-11-13 10:54:23,797][__main__][INFO] - agents played in iteration 454 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:54:24,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:24,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:24,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:24,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:24,351][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:24,351][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:25,089][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:25,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:25,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:26,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:27,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:27,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:27,991][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:28,645][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:28,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:29,297][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:29,952][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:30,283][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:30,613][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:31,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:31,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:32,574][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:32,902][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:33,233][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:34,220][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:34,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:34,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:35,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:36,248][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:36,974][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:36,976][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:36,977][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:37,884][__main__][INFO] - Iteration 455 took 23s (40.79% Gen, 55.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 58m 50s. Estimated total time: 19h 49m 38s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 16s. [2025-11-13 10:54:37,886][__main__][INFO] - Starting iteration 455. [2025-11-13 10:54:37,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:37,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:47,339][__main__][INFO] - Number of regex retries in iteration 455: 0 [2025-11-13 10:54:47,340][__main__][INFO] - agents played in iteration 455 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:54:47,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:47,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:47,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:48,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:48,942][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:49,268][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:49,594][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:49,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:50,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:50,895][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:51,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:51,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:51,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:52,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:52,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:52,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:53,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:53,835][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:54,159][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:54,489][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:54,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:55,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:55,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:56,459][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:56,784][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:57,111][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:57,763][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:58,091][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:58,741][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:59,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:59,780][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:00,498][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:00,499][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:00,501][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:01,517][__main__][INFO] - Iteration 456 took 23s (39.99% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 50m 13s. Estimated total time: 19h 41m 26s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 54s. [2025-11-13 10:55:01,518][__main__][INFO] - Starting iteration 456. [2025-11-13 10:55:01,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:01,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:11,350][__main__][INFO] - Number of regex retries in iteration 456: 0 [2025-11-13 10:55:11,350][__main__][INFO] - agents played in iteration 456 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:55:11,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:11,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:11,899][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:13,260][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:13,585][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:13,910][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:14,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:15,540][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:16,191][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:16,517][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:16,842][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:17,820][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:18,145][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:18,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:18,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:19,459][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:20,768][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:21,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:21,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:22,396][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:22,721][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:23,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:23,765][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:24,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:24,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:24,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:25,404][__main__][INFO] - Iteration 457 took 23s (41.15% Gen, 55.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 34s. Estimated total time: 19h 54m 10s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 1s. [2025-11-13 10:55:25,406][__main__][INFO] - Starting iteration 457. [2025-11-13 10:55:25,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:25,410][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:33,994][__main__][INFO] - Number of regex retries in iteration 457: 0 [2025-11-13 10:55:33,995][__main__][INFO] - agents played in iteration 457 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:55:34,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:34,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:34,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:35,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:35,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:35,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:36,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:36,886][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:37,211][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:37,537][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:37,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:39,166][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:39,490][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:39,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:40,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:40,789][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:41,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:42,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:42,749][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:43,076][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:43,401][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:43,726][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:44,051][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:44,375][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:45,024][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:45,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:46,405][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:47,151][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:47,153][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:47,154][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:48,111][__main__][INFO] - Iteration 458 took 22s (37.81% Gen, 57.97% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 3m 8s. Estimated total time: 18h 55m 7s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 11s. [2025-11-13 10:55:48,113][__main__][INFO] - Starting iteration 458. [2025-11-13 10:55:48,116][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:48,116][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:57,008][__main__][INFO] - Number of regex retries in iteration 458: 0 [2025-11-13 10:55:57,008][__main__][INFO] - agents played in iteration 458 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:55:57,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:57,556][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:57,557][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:58,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:58,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:59,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:59,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:00,561][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:00,887][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:01,862][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:02,187][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:02,513][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:02,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:03,487][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:03,811][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:04,138][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:04,468][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:04,798][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:05,125][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:05,453][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:06,111][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:06,438][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:06,763][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:07,750][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:08,079][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:08,405][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:08,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:09,449][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:10,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:10,185][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:10,186][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:11,193][__main__][INFO] - Iteration 459 took 23s (38.53% Gen, 57.10% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 21m 32s. Estimated total time: 19h 13m 54s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 19s. [2025-11-13 10:56:11,196][__main__][INFO] - Starting iteration 459. [2025-11-13 10:56:11,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:11,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:20,887][__main__][INFO] - Number of regex retries in iteration 459: 0 [2025-11-13 10:56:20,887][__main__][INFO] - agents played in iteration 459 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:56:21,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:21,438][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:21,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:22,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:22,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:23,134][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:23,459][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:23,786][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:24,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:24,438][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:25,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:25,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:26,066][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:26,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:26,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:27,054][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:27,381][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:27,706][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:28,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:28,360][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:28,684][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:29,340][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:30,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:30,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:31,628][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:32,277][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:32,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:33,330][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:34,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:34,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:34,134][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:35,077][__main__][INFO] - Iteration 460 took 23s (40.57% Gen, 55.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 9s. Estimated total time: 19h 53m 55s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 59s. [2025-11-13 10:56:35,080][__main__][INFO] - Starting iteration 460. [2025-11-13 10:56:35,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:35,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:44,145][__main__][INFO] - Number of regex retries in iteration 460: 0 [2025-11-13 10:56:44,145][__main__][INFO] - agents played in iteration 460 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:56:44,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:44,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:44,699][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:45,744][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:46,071][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:46,396][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:46,721][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:47,372][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:47,697][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:48,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:48,999][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:49,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:49,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:49,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:50,300][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:50,625][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:50,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:51,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:52,251][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:53,562][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:54,225][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:54,551][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:55,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:55,526][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:55,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:56,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:57,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:57,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:57,302][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:59,352][__main__][INFO] - Iteration 461 took 24s (37.34% Gen, 54.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 18s. Estimated total time: 20h 13m 28s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 26s, 500 more iterations: 3h 22m 14s. [2025-11-13 10:56:59,354][__main__][INFO] - Starting iteration 461. [2025-11-13 10:56:59,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:56:59,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:08,505][__main__][INFO] - Number of regex retries in iteration 461: 0 [2025-11-13 10:57:08,506][__main__][INFO] - agents played in iteration 461 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:57:08,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:08,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:09,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:09,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:09,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:09,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:09,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:10,749][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:11,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:11,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:11,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:12,052][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:12,377][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:13,028][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:13,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:14,013][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:14,665][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:14,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:15,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:15,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:15,974][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:17,286][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:17,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:18,586][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:18,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:19,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:19,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:20,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:20,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:21,650][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:21,651][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:21,653][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:22,622][__main__][INFO] - Iteration 462 took 23s (39.32% Gen, 56.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 29m 46s. Estimated total time: 19h 23m 19s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 53s. [2025-11-13 10:57:22,624][__main__][INFO] - Starting iteration 462. [2025-11-13 10:57:22,628][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:22,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:32,330][__main__][INFO] - Number of regex retries in iteration 462: 0 [2025-11-13 10:57:32,330][__main__][INFO] - agents played in iteration 462 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:57:32,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:32,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:32,885][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:33,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:33,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:34,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:34,902][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:35,227][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:35,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:36,203][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:36,853][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:37,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:37,831][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:38,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:38,485][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:39,461][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:39,788][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:40,440][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:40,765][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:41,091][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:41,743][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:42,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:42,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:43,041][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:43,365][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:44,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:44,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:45,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:45,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:45,493][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:46,627][__main__][INFO] - Iteration 463 took 23s (40.42% Gen, 54.85% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 1s. Estimated total time: 19h 59m 59s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 59s. [2025-11-13 10:57:46,629][__main__][INFO] - Starting iteration 463. [2025-11-13 10:57:46,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:46,632][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:55,558][__main__][INFO] - Number of regex retries in iteration 463: 0 [2025-11-13 10:57:55,558][__main__][INFO] - agents played in iteration 463 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:57:56,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:56,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:56,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:56,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:57,160][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:57,486][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:57,809][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:58,136][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:58,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:59,111][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:59,437][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:59,763][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:00,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:00,413][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:00,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:01,393][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:01,718][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:02,372][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:02,700][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:03,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:03,680][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:04,334][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:04,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:04,987][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:05,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:05,642][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:06,950][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:07,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:07,971][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:08,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:08,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:08,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:09,732][__main__][INFO] - Iteration 464 took 23s (38.64% Gen, 56.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 20m 43s. Estimated total time: 19h 15m 3s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 30s. [2025-11-13 10:58:09,734][__main__][INFO] - Starting iteration 464. [2025-11-13 10:58:09,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:09,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:19,407][__main__][INFO] - Number of regex retries in iteration 464: 0 [2025-11-13 10:58:19,407][__main__][INFO] - agents played in iteration 464 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:58:19,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:19,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:19,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:19,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:19,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:19,967][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:20,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:21,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:21,341][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:21,668][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:21,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:22,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:22,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:22,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:23,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:23,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:23,943][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:24,268][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:24,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:24,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:25,574][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:25,901][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:26,227][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:26,553][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:26,877][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:27,203][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:27,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:27,854][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:28,182][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:28,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:30,126][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:30,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:31,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:31,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:32,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:32,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:32,562][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:33,572][__main__][INFO] - Iteration 465 took 23s (40.57% Gen, 55.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 57m 1s. Estimated total time: 19h 51m 45s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 37s. [2025-11-13 10:58:33,574][__main__][INFO] - Starting iteration 465. [2025-11-13 10:58:33,577][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:33,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:42,751][__main__][INFO] - Number of regex retries in iteration 465: 0 [2025-11-13 10:58:42,752][__main__][INFO] - agents played in iteration 465 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:58:43,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:43,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:43,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:43,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:43,303][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:43,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:44,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:44,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:44,995][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:45,321][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:45,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:45,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:46,624][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:46,950][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:47,275][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:47,600][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:47,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:48,578][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:48,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:49,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:49,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:50,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:50,532][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:51,838][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:53,137][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:53,461][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:53,785][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:54,111][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:54,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:55,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:55,884][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:55,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:55,887][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:56,839][__main__][INFO] - Iteration 466 took 23s (39.44% Gen, 56.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 27m 59s. Estimated total time: 19h 23m 7s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 51s. [2025-11-13 10:58:56,841][__main__][INFO] - Starting iteration 466. [2025-11-13 10:58:56,844][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:56,845][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:05,588][__main__][INFO] - Number of regex retries in iteration 466: 0 [2025-11-13 10:59:05,588][__main__][INFO] - agents played in iteration 466 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:59:06,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:06,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:06,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:06,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:07,181][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:07,832][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:08,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:08,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:09,132][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:09,458][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:09,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:10,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:10,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:11,090][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:11,415][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:11,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:12,064][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:12,389][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:13,042][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:13,697][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:14,026][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:14,353][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:15,013][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:15,342][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:15,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:15,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:16,329][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:16,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:16,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:17,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:18,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:18,731][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:18,733][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:18,734][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:19,983][__main__][INFO] - Iteration 467 took 23s (37.78% Gen, 56.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 21m 27s. Estimated total time: 19h 16m 58s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 49s. [2025-11-13 10:59:19,985][__main__][INFO] - Starting iteration 467. [2025-11-13 10:59:19,988][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:19,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:28,787][__main__][INFO] - Number of regex retries in iteration 467: 0 [2025-11-13 10:59:28,787][__main__][INFO] - agents played in iteration 467 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:59:29,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:29,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:29,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:29,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:29,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:29,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:30,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:30,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:30,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:31,095][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:31,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:32,722][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:33,047][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:33,377][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:33,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:34,031][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:34,358][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:34,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:35,338][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:35,990][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:36,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:36,977][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:37,303][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:37,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:37,954][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:38,605][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:38,931][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:39,908][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:40,232][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:40,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:41,245][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:41,956][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:41,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:41,959][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:42,920][__main__][INFO] - Iteration 468 took 22s (38.37% Gen, 57.43% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 10m 45s. Estimated total time: 19h 6m 39s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 6s. [2025-11-13 10:59:42,923][__main__][INFO] - Starting iteration 468. [2025-11-13 10:59:42,926][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:42,927][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:52,095][__main__][INFO] - Number of regex retries in iteration 468: 0 [2025-11-13 10:59:52,096][__main__][INFO] - agents played in iteration 468 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 10:59:52,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:52,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:52,640][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:53,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:53,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:54,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:54,672][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:54,998][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:55,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:55,648][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:55,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:56,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:56,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:56,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:58,254][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:58,581][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:58,908][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:59,231][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:59,556][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:59,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:00,210][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:00,861][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:01,512][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:01,837][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:02,166][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:02,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:02,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:03,138][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:03,464][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:03,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:04,465][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:05,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:05,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:05,188][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:06,237][__main__][INFO] - Iteration 469 took 23s (39.33% Gen, 56.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 29m 17s. Estimated total time: 19h 25m 34s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 15s. [2025-11-13 11:00:06,239][__main__][INFO] - Starting iteration 469. [2025-11-13 11:00:06,243][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:06,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:15,570][__main__][INFO] - Number of regex retries in iteration 469: 0 [2025-11-13 11:00:15,571][__main__][INFO] - agents played in iteration 469 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:00:16,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:16,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:16,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:16,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:16,112][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:16,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:17,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:18,783][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:19,108][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:20,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:20,410][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:20,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:21,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:21,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:21,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:22,043][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:22,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:23,025][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:23,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:23,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:24,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:25,313][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:25,638][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:25,962][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:26,941][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:27,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:27,969][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:28,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:28,699][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:28,701][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:29,668][__main__][INFO] - Iteration 470 took 23s (39.82% Gen, 56.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 34m 39s. Estimated total time: 19h 31m 19s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 13s. [2025-11-13 11:00:29,670][__main__][INFO] - Starting iteration 470. [2025-11-13 11:00:29,673][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:29,674][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:38,163][__main__][INFO] - Number of regex retries in iteration 470: 0 [2025-11-13 11:00:38,164][__main__][INFO] - agents played in iteration 470 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:00:38,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:39,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:39,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:39,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:39,070][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:39,070][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:40,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:40,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:40,763][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:41,088][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:41,413][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:42,064][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:42,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:42,714][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:43,364][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:43,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:44,016][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:44,342][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:44,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:45,653][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:45,980][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:46,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:47,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:48,270][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:48,597][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:49,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:49,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:50,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:50,928][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:51,657][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:51,659][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:51,660][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:53,646][__main__][INFO] - Iteration 471 took 23s (35.41% Gen, 56.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 1m 35s. Estimated total time: 19h 58m 40s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 46s. [2025-11-13 11:00:53,648][__main__][INFO] - Starting iteration 471. [2025-11-13 11:00:53,652][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:00:53,653][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:03,070][__main__][INFO] - Number of regex retries in iteration 471: 0 [2025-11-13 11:01:03,070][__main__][INFO] - agents played in iteration 471 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:01:03,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:03,626][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:03,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:04,377][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:04,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:05,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:05,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:06,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:06,629][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:06,954][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:07,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:07,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:08,582][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:08,912][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:09,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:10,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:10,543][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:11,193][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:11,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:11,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:12,171][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:12,496][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:12,821][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:13,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:13,799][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:14,454][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:14,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:15,503][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:16,238][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:16,240][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:16,241][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:17,254][__main__][INFO] - Iteration 472 took 23s (39.90% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 42m 41s. Estimated total time: 19h 40m 9s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 41s. [2025-11-13 11:01:17,257][__main__][INFO] - Starting iteration 472. [2025-11-13 11:01:17,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:17,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:25,707][__main__][INFO] - Number of regex retries in iteration 472: 0 [2025-11-13 11:01:25,707][__main__][INFO] - agents played in iteration 472 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:01:26,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:26,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:26,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:26,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:26,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:26,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:27,356][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:27,980][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:28,305][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:28,958][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:29,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:29,610][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:29,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:30,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:30,585][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:30,910][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:31,238][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:31,892][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:32,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:32,547][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:32,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:33,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:33,527][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:33,852][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:34,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:34,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:34,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:35,160][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:35,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:36,476][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:36,802][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:37,452][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:37,777][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:38,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:39,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:39,252][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:39,254][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:40,229][__main__][INFO] - Iteration 473 took 22s (36.77% Gen, 58.98% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 10m 39s. Estimated total time: 19h 8m 30s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 25s. [2025-11-13 11:01:40,231][__main__][INFO] - Starting iteration 473. [2025-11-13 11:01:40,235][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:40,235][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:49,329][__main__][INFO] - Number of regex retries in iteration 473: 0 [2025-11-13 11:01:49,329][__main__][INFO] - agents played in iteration 473 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:01:49,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:49,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:49,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:49,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:49,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:49,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:50,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:51,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:51,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:51,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:52,222][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:52,548][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:52,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:53,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:53,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:54,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:54,499][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:54,826][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:55,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:55,476][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:57,103][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:57,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:57,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:58,404][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:59,064][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:59,391][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:59,719][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:00,049][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:00,379][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:00,705][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:01,030][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:01,714][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:02,433][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:02,435][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:02,436][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:03,436][__main__][INFO] - Iteration 474 took 23s (39.19% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 21m 51s. Estimated total time: 19h 20m 6s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 21s. [2025-11-13 11:02:03,438][__main__][INFO] - Starting iteration 474. [2025-11-13 11:02:03,441][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:03,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:12,881][__main__][INFO] - Number of regex retries in iteration 474: 0 [2025-11-13 11:02:12,881][__main__][INFO] - agents played in iteration 474 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:02:13,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:13,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:13,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:13,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:13,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:13,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:14,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:14,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:14,811][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:15,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:15,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:15,788][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:16,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:16,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:16,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:17,092][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:17,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:17,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:18,067][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:18,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:18,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:19,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:19,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:20,017][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:20,341][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:20,992][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:21,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:21,973][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:22,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:22,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:23,605][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:23,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:24,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:25,269][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:25,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:26,001][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:26,002][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:27,126][__main__][INFO] - Iteration 475 took 23s (39.85% Gen, 55.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 45m 38s. Estimated total time: 19h 44m 16s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 22s. [2025-11-13 11:02:27,128][__main__][INFO] - Starting iteration 475. [2025-11-13 11:02:27,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:27,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:36,395][__main__][INFO] - Number of regex retries in iteration 475: 0 [2025-11-13 11:02:36,395][__main__][INFO] - agents played in iteration 475 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:02:36,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:36,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:36,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:36,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:36,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:36,947][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:37,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:38,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:38,337][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:38,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:38,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:39,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:39,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:40,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:40,943][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:41,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:41,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:41,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:42,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:42,572][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:43,223][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:43,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:43,875][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:44,861][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:45,185][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:45,511][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:45,836][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:46,165][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:46,496][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:46,825][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:47,150][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:47,807][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:48,131][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:48,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:49,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:49,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:49,553][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:50,543][__main__][INFO] - Iteration 476 took 23s (39.28% Gen, 56.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 31m 34s. Estimated total time: 19h 30m 35s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 5s. [2025-11-13 11:02:50,545][__main__][INFO] - Starting iteration 476. [2025-11-13 11:02:50,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:50,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:59,778][__main__][INFO] - Number of regex retries in iteration 476: 0 [2025-11-13 11:02:59,779][__main__][INFO] - agents played in iteration 476 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:03:00,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:00,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:00,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:00,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:00,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:00,332][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:01,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:01,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:01,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:02,041][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:02,366][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:02,691][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:03,017][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:03,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:03,994][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:04,320][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:04,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:05,296][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:05,621][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:06,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:07,255][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:07,581][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:08,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:08,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:09,207][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:09,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:09,858][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:10,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:10,512][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:10,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:11,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:12,180][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:12,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:12,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:12,904][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:13,866][__main__][INFO] - Iteration 477 took 23s (39.58% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 26m 31s. Estimated total time: 19h 25m 55s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 19s. [2025-11-13 11:03:13,868][__main__][INFO] - Starting iteration 477. [2025-11-13 11:03:13,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:13,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:23,585][__main__][INFO] - Number of regex retries in iteration 477: 0 [2025-11-13 11:03:23,586][__main__][INFO] - agents played in iteration 477 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:03:24,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:24,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:24,144][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:24,891][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:25,187][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:26,163][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:27,139][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:27,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:27,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:28,115][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:28,441][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:29,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:29,743][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:30,067][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:30,392][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:31,376][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:31,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:32,033][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:32,359][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:32,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:33,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:34,322][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:34,646][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:34,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:35,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:36,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:36,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:36,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:36,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:37,740][__main__][INFO] - Iteration 478 took 23s (40.70% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 53m 40s. Estimated total time: 19h 53m 29s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 54s. [2025-11-13 11:03:37,742][__main__][INFO] - Starting iteration 478. [2025-11-13 11:03:37,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:37,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:06:51,641][mllm.models.large_language_model_local][INFO] - Loaded 47 past agent adapters from checkpoints directory. [2025-11-13 11:07:10,399][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,822][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,830][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:07:13,110][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:09:23,125][mllm.training.trainer_common][INFO] - Loading trainer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:09:23,128][mllm.training.trainer_common][INFO] - Loading policy optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:09:23,834][mllm.training.trainer_common][INFO] - Loading critic optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:09:23,837][__main__][INFO] - Starting iteration 478. [2025-11-13 11:09:23,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:09:23,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:09:52,138][__main__][INFO] - Number of regex retries in iteration 478: 0 [2025-11-13 11:09:52,139][__main__][INFO] - agents played in iteration 478 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:09:52,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:52,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:52,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:52,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:52,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:09:52,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:09:53,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:09:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:09:54,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:09:54,741][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:09:55,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:09:55,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:09:55,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:09:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:09:56,379][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:09:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:09:57,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:09:57,362][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:09:57,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:09:58,013][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:09:58,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:09:58,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:09:58,993][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:09:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:09:59,640][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:09:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:00,290][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:00,612][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:00,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:01,265][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:01,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:01,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:02,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:02,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:03,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:03,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:04,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:04,880][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.78%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:10:05,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:05,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:05,779][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:06,860][__main__][INFO] - Iteration 479 took 43s (65.76% Gen, 31.72% Train). Generation: 28s, Training: 13s. Estimated remaining time: 35h 46m 48s. Estimated total time: 35h 50m 5s. Time estimates for 10 more iterations: 7m 10s, 100 more iterations: 1h 11m 40s, 500 more iterations: 5h 58m 20s. [2025-11-13 11:10:06,862][__main__][INFO] - Starting iteration 479. [2025-11-13 11:10:06,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:06,866][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:25,615][__main__][INFO] - Number of regex retries in iteration 479: 0 [2025-11-13 11:10:25,616][__main__][INFO] - agents played in iteration 479 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:10:26,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:26,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:26,854][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:27,473][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:28,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:28,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:29,427][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:29,753][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:30,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:30,416][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:31,063][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:32,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:33,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:33,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:34,006][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:34,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:34,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:35,974][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:36,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:36,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:36,956][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:37,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:37,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:10:38,713][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:38,715][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:38,718][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:39,622][__main__][INFO] - Iteration 480 took 32s (57.24% Gen, 40.00% Train). Generation: 18s, Training: 13s. Estimated remaining time: 27h 14m 4s. Estimated total time: 27h 17m 54s. Time estimates for 10 more iterations: 5m 27s, 100 more iterations: 54m 35s, 500 more iterations: 4h 32m 59s. [2025-11-13 11:10:39,625][__main__][INFO] - Starting iteration 480. [2025-11-13 11:10:39,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:39,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:53,595][__main__][INFO] - Number of regex retries in iteration 480: 0 [2025-11-13 11:10:53,595][__main__][INFO] - agents played in iteration 480 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:10:54,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:54,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:54,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:54,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:54,160][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:54,160][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:54,812][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:55,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:56,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:56,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:56,735][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:57,062][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:57,708][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:58,359][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:58,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:59,007][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:59,335][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:59,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:59,979][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:00,303][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:00,626][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:01,275][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:01,599][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:01,922][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:02,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:02,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:02,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:03,248][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:03,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:03,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:04,880][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:05,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:05,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:06,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:06,663][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:06,665][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:08,404][__main__][INFO] - Iteration 481 took 28s (48.53% Gen, 45.42% Train). Generation: 13s, Training: 13s. Estimated remaining time: 23h 54m 29s. Estimated total time: 23h 58m 47s. Time estimates for 10 more iterations: 4m 47s, 100 more iterations: 47m 57s, 500 more iterations: 3h 59m 47s. [2025-11-13 11:11:08,405][__main__][INFO] - Starting iteration 481. [2025-11-13 11:11:08,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:08,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:19,356][__main__][INFO] - Number of regex retries in iteration 481: 0 [2025-11-13 11:11:19,357][__main__][INFO] - agents played in iteration 481 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:11:19,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,905][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:19,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:20,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:20,902][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:21,225][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:21,548][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:22,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:22,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:23,178][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:23,501][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:23,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:24,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:24,473][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:24,802][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:25,126][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:25,449][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:25,772][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:26,419][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:26,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:27,065][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:27,389][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:27,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:28,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:29,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:29,338][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:30,645][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:30,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:31,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:32,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:32,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:32,356][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:33,185][__main__][INFO] - Iteration 482 took 24s (44.18% Gen, 52.47% Train). Generation: 10s, Training: 12s. Estimated remaining time: 20h 34m 10s. Estimated total time: 20h 38m 54s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 17s, 500 more iterations: 3h 26m 29s. [2025-11-13 11:11:33,187][__main__][INFO] - Starting iteration 482. [2025-11-13 11:11:33,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:33,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:43,094][__main__][INFO] - Number of regex retries in iteration 482: 0 [2025-11-13 11:11:43,094][__main__][INFO] - agents played in iteration 482 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:11:43,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:43,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:43,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:43,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:43,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:43,638][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:44,594][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:44,919][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:45,570][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:46,546][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:46,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:47,516][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:48,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:49,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:49,455][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:49,779][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:50,104][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:50,427][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:50,750][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:51,073][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:51,397][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:52,049][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:52,373][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:53,019][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:53,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:53,994][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:54,319][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:54,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:55,301][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:56,011][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:56,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:56,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:56,830][__main__][INFO] - Iteration 483 took 23s (41.89% Gen, 54.65% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 36m 56s. Estimated total time: 19h 42m 3s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 0s. [2025-11-13 11:11:56,832][__main__][INFO] - Starting iteration 483. [2025-11-13 11:11:56,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:56,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:06,279][__main__][INFO] - Number of regex retries in iteration 483: 0 [2025-11-13 11:12:06,279][__main__][INFO] - agents played in iteration 483 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:12:06,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:06,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:06,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:06,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:06,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:06,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:07,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:08,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:08,477][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:08,804][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:09,467][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:09,788][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:10,774][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:11,103][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:11,437][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:11,764][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:12,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:12,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:12,742][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:13,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:13,721][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:14,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:14,374][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:14,699][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:15,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:15,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:16,648][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:16,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:17,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:17,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:18,627][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:19,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:19,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:19,352][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:20,323][__main__][INFO] - Iteration 484 took 23s (40.20% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 28m 55s. Estimated total time: 19h 34m 25s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 44s. [2025-11-13 11:12:20,325][__main__][INFO] - Starting iteration 484. [2025-11-13 11:12:20,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:20,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:29,929][__main__][INFO] - Number of regex retries in iteration 484: 0 [2025-11-13 11:12:29,930][__main__][INFO] - agents played in iteration 484 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:12:30,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:30,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:30,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:30,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:30,489][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:30,489][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:31,204][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:32,487][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:32,812][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:33,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:33,461][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:33,785][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:34,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:34,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:35,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:35,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:35,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:36,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:36,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:37,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:37,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:37,686][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:38,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:38,332][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:38,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:39,303][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:39,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:39,951][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:40,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:40,598][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:40,923][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:41,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:41,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:42,221][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:42,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:42,940][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:42,941][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:43,803][__main__][INFO] - Iteration 485 took 23s (40.89% Gen, 55.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 27m 54s. Estimated total time: 19h 33m 48s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 38s. [2025-11-13 11:12:43,806][__main__][INFO] - Starting iteration 485. [2025-11-13 11:12:43,809][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:43,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:53,536][__main__][INFO] - Number of regex retries in iteration 485: 0 [2025-11-13 11:12:53,537][__main__][INFO] - agents played in iteration 485 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:12:53,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,095][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:54,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:55,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:56,087][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:57,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:57,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:57,738][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:58,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:58,391][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:58,716][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:59,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:59,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:59,701][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:00,027][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:00,351][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:00,673][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:01,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:01,647][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:01,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:02,297][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:02,622][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:02,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:03,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:03,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:03,935][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:04,267][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:04,585][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:04,909][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:05,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:05,892][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:06,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:06,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:06,627][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:07,593][__main__][INFO] - Iteration 486 took 23s (40.90% Gen, 55.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 42m 56s. Estimated total time: 19h 49m 14s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 12s. [2025-11-13 11:13:07,595][__main__][INFO] - Starting iteration 486. [2025-11-13 11:13:07,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:07,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:16,768][__main__][INFO] - Number of regex retries in iteration 486: 0 [2025-11-13 11:13:16,768][__main__][INFO] - agents played in iteration 486 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:13:17,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:17,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:17,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:17,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:17,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:17,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:18,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:18,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:18,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:19,004][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:19,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:19,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:19,978][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:20,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:20,958][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:21,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:21,603][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:21,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:22,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:22,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:23,582][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:24,238][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:24,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:25,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:25,876][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:26,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:26,851][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:27,176][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:27,507][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:27,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:28,157][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:28,481][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:29,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:29,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:29,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:29,886][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:30,781][__main__][INFO] - Iteration 487 took 23s (39.55% Gen, 56.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 12m 30s. Estimated total time: 19h 19m 11s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 11s. [2025-11-13 11:13:30,783][__main__][INFO] - Starting iteration 487. [2025-11-13 11:13:30,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:30,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:40,470][__main__][INFO] - Number of regex retries in iteration 487: 0 [2025-11-13 11:13:40,471][__main__][INFO] - agents played in iteration 487 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:13:40,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:40,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:40,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:41,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:41,017][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:41,018][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:41,746][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:42,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:42,691][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:43,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:43,341][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:43,992][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:44,317][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:44,964][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:45,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:46,273][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:47,250][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:47,573][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:47,900][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:48,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:48,552][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:49,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:49,530][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:49,853][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:50,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:51,152][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:51,479][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:51,802][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:52,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:52,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:53,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:53,504][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:53,506][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:54,396][__main__][INFO] - Iteration 488 took 23s (41.01% Gen, 55.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 33m 27s. Estimated total time: 19h 40m 32s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 45s. [2025-11-13 11:13:54,398][__main__][INFO] - Starting iteration 488. [2025-11-13 11:13:54,401][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:54,402][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:03,029][__main__][INFO] - Number of regex retries in iteration 488: 0 [2025-11-13 11:14:03,030][__main__][INFO] - agents played in iteration 488 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:14:03,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:03,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:03,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:03,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:03,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:03,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:04,293][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:04,590][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:04,919][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:05,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:05,570][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:05,899][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:06,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:06,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:06,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:07,232][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:07,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:07,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:08,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:08,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:09,211][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:09,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:10,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:11,168][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:11,819][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:12,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:12,475][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:13,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:13,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:14,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:14,420][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:14,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:15,407][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:16,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:16,121][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:16,124][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:17,026][__main__][INFO] - Iteration 489 took 22s (38.13% Gen, 57.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 43m 50s. Estimated total time: 18h 51m 18s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 42s, 500 more iterations: 3h 8m 33s. [2025-11-13 11:14:17,028][__main__][INFO] - Starting iteration 489. [2025-11-13 11:14:17,032][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:17,033][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:26,543][__main__][INFO] - Number of regex retries in iteration 489: 0 [2025-11-13 11:14:26,544][__main__][INFO] - agents played in iteration 489 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:14:26,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:27,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:27,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:27,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:27,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:27,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:28,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:28,454][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:28,781][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:29,108][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:29,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:29,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:30,418][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:30,744][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:31,073][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:31,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:32,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:32,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:33,026][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:33,678][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:34,010][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:34,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:34,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:35,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:35,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:36,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:36,623][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:36,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:37,595][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:37,920][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:38,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:38,887][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:39,611][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:39,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:39,614][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:40,493][__main__][INFO] - Iteration 490 took 23s (40.54% Gen, 55.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 25m 14s. Estimated total time: 19h 33m 5s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 30s. [2025-11-13 11:14:40,495][__main__][INFO] - Starting iteration 490. [2025-11-13 11:14:40,499][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:40,499][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:49,724][__main__][INFO] - Number of regex retries in iteration 490: 0 [2025-11-13 11:14:49,724][__main__][INFO] - agents played in iteration 490 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:14:50,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:50,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:50,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:50,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:50,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:50,278][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:51,006][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:51,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:51,630][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:51,954][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:52,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:52,607][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:52,930][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:53,589][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:54,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:54,558][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:55,208][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:56,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:57,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:57,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:58,151][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:58,475][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:58,800][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:59,124][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:59,448][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:59,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:00,424][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:00,748][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:01,072][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:01,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:02,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:02,769][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:02,770][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:02,772][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:04,645][__main__][INFO] - Iteration 491 took 24s (38.21% Gen, 54.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 59m 8s. Estimated total time: 20h 7m 22s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 14s, 500 more iterations: 3h 21m 13s. [2025-11-13 11:15:04,647][__main__][INFO] - Starting iteration 491. [2025-11-13 11:15:04,651][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:04,651][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:13,866][__main__][INFO] - Number of regex retries in iteration 491: 0 [2025-11-13 11:15:13,867][__main__][INFO] - agents played in iteration 491 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:15:14,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:14,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:14,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:14,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:14,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:14,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:15,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:15,757][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:16,749][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:17,074][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:17,399][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:17,725][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:18,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:18,374][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:19,025][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:19,351][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:20,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:20,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:20,658][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:20,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:21,312][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:21,644][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:21,967][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:22,298][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:22,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:22,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:23,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:23,607][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:23,933][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:24,257][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:24,582][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:24,908][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:25,232][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:25,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:26,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:26,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:26,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:26,943][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:27,891][__main__][INFO] - Iteration 492 took 23s (39.65% Gen, 56.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 13m 23s. Estimated total time: 19h 22m 2s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 40s. [2025-11-13 11:15:27,893][__main__][INFO] - Starting iteration 492. [2025-11-13 11:15:27,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:27,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:37,115][__main__][INFO] - Number of regex retries in iteration 492: 0 [2025-11-13 11:15:37,115][__main__][INFO] - agents played in iteration 492 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:15:37,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:37,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:37,670][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:39,011][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:39,335][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:39,662][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:39,993][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:40,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:40,650][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:40,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:41,306][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:41,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:41,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:42,938][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:43,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:43,594][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:43,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:44,249][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:44,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:45,565][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:45,892][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:46,219][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:46,544][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:46,871][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:47,195][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:47,522][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:47,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:48,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:48,519][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:48,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:49,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:50,191][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:50,192][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:50,194][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:51,079][__main__][INFO] - Iteration 493 took 23s (39.76% Gen, 56.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 10m 9s. Estimated total time: 19h 19m 10s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 11s. [2025-11-13 11:15:51,081][__main__][INFO] - Starting iteration 493. [2025-11-13 11:15:51,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:51,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:59,121][__main__][INFO] - Number of regex retries in iteration 493: 0 [2025-11-13 11:15:59,122][__main__][INFO] - agents played in iteration 493 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:15:59,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:59,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:59,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:00,429][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:00,725][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:01,708][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:03,011][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:03,336][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:03,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:03,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:04,317][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:04,642][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:04,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:05,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:05,622][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:05,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:06,598][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:06,923][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:07,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:07,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:08,238][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:08,564][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:09,543][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:09,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:10,194][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:10,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:11,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:12,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:12,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:12,225][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:13,217][__main__][INFO] - Iteration 494 took 22s (36.31% Gen, 59.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 17s. Estimated total time: 18h 26m 41s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 53s, 500 more iterations: 3h 4m 26s. [2025-11-13 11:16:13,219][__main__][INFO] - Starting iteration 494. [2025-11-13 11:16:13,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:13,223][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:22,026][__main__][INFO] - Number of regex retries in iteration 494: 0 [2025-11-13 11:16:22,027][__main__][INFO] - agents played in iteration 494 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:16:22,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:22,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:22,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:22,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:22,592][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:22,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:23,301][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:23,604][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:23,922][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:24,247][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:24,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:25,552][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:25,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:26,211][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:26,533][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:26,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:27,845][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:28,171][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:28,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:28,823][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:29,147][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:29,471][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:30,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:30,456][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:30,781][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:31,107][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:31,435][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:32,095][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:32,423][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:32,753][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:33,083][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:33,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:33,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:34,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:35,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:35,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:35,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:36,039][__main__][INFO] - Iteration 495 took 22s (38.58% Gen, 57.46% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 51m 5s. Estimated total time: 19h 0m 52s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 1s, 500 more iterations: 3h 10m 8s. [2025-11-13 11:16:36,041][__main__][INFO] - Starting iteration 495. [2025-11-13 11:16:36,045][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:36,045][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:44,238][__main__][INFO] - Number of regex retries in iteration 495: 0 [2025-11-13 11:16:44,238][__main__][INFO] - agents played in iteration 495 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:16:44,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:44,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:44,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:44,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:44,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:44,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:45,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:46,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:46,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:47,134][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:47,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:47,784][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:48,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:48,760][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:49,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:49,412][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:49,739][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:50,065][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:50,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:50,719][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:51,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:51,706][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:52,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:52,361][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:52,692][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:53,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:54,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:54,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:55,326][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:55,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:56,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:57,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:57,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:57,388][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:58,269][__main__][INFO] - Iteration 496 took 22s (36.87% Gen, 59.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 21m 5s. Estimated total time: 18h 31m 14s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 2s, 500 more iterations: 3h 5m 12s. [2025-11-13 11:16:58,271][__main__][INFO] - Starting iteration 496. [2025-11-13 11:16:58,274][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:58,274][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:06,827][__main__][INFO] - Number of regex retries in iteration 496: 0 [2025-11-13 11:17:06,827][__main__][INFO] - agents played in iteration 496 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:17:07,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:07,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:07,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:07,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:07,414][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:07,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:08,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:08,440][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:08,767][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:09,092][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:09,418][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:09,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:10,410][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:10,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:11,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:12,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:12,370][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:12,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:13,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:13,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:13,999][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:14,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:14,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:15,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:15,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:16,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:17,278][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:17,607][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:17,939][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:18,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:19,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:20,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:20,009][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:20,011][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:20,909][__main__][INFO] - Iteration 497 took 22s (37.78% Gen, 58.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 41m 18s. Estimated total time: 18h 51m 49s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 38s. [2025-11-13 11:17:20,911][__main__][INFO] - Starting iteration 497. [2025-11-13 11:17:20,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:20,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:29,262][__main__][INFO] - Number of regex retries in iteration 497: 0 [2025-11-13 11:17:29,263][__main__][INFO] - agents played in iteration 497 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:17:29,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:29,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:29,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:29,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:29,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:29,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:30,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:30,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:31,184][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:31,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:32,162][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:32,488][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:32,816][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:33,143][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:33,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:33,794][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:34,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:35,425][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:35,751][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:36,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:36,403][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:36,729][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:37,055][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:37,380][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:38,360][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:38,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:39,016][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:39,340][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:39,667][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:40,317][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:40,645][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:40,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:41,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:42,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:42,385][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:42,387][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:43,323][__main__][INFO] - Iteration 498 took 22s (37.25% Gen, 58.56% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 29m 34s. Estimated total time: 18h 40m 27s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 20s, 500 more iterations: 3h 6m 44s. [2025-11-13 11:17:43,326][__main__][INFO] - Starting iteration 498. [2025-11-13 11:17:43,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:43,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:51,218][__main__][INFO] - Number of regex retries in iteration 498: 0 [2025-11-13 11:17:51,218][__main__][INFO] - agents played in iteration 498 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:17:51,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:51,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:51,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:51,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:51,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:51,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:52,501][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:53,122][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:53,449][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:54,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:54,755][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:55,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:56,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:57,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:57,368][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:57,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:58,357][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:59,009][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:59,335][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:01,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:01,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:01,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:02,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:03,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:04,345][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:04,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:04,349][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:05,220][__main__][INFO] - Iteration 499 took 21s (36.03% Gen, 59.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 3m 21s. Estimated total time: 18h 14m 36s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 29s, 500 more iterations: 3h 2m 26s. [2025-11-13 11:18:05,223][__main__][INFO] - Starting iteration 499. [2025-11-13 11:18:05,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:05,227][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:13,871][__main__][INFO] - Number of regex retries in iteration 499: 0 [2025-11-13 11:18:13,871][__main__][INFO] - agents played in iteration 499 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:18:14,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:14,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:14,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:14,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:14,416][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:14,416][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:15,109][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:15,738][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:16,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:16,384][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:17,034][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:17,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:17,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:18,010][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:18,989][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:19,314][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:19,645][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:19,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:20,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:20,946][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:21,267][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:21,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:21,918][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:22,249][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:22,570][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:22,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:23,222][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:23,548][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:23,874][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:24,201][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:24,526][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:24,853][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:25,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:26,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:26,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:26,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:26,925][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:27,845][__main__][INFO] - Iteration 500 took 22s (38.21% Gen, 57.71% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 39m 22s. Estimated total time: 18h 51m 0s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 42s, 500 more iterations: 3h 8m 30s. [2025-11-13 11:18:27,848][__main__][INFO] - Starting iteration 500. [2025-11-13 11:18:27,851][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:27,851][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:35,997][__main__][INFO] - Number of regex retries in iteration 500: 0 [2025-11-13 11:18:35,998][__main__][INFO] - agents played in iteration 500 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:18:36,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:36,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:36,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:36,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:36,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:36,550][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:37,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:37,881][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:38,538][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:38,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:39,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:39,841][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:40,167][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:40,493][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:40,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:41,145][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:41,470][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:41,797][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:42,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:42,454][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:43,433][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:43,759][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:44,084][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:44,412][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:44,745][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:45,071][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:45,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:45,723][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:46,048][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:46,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:47,026][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:47,352][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:47,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:48,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:49,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:49,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:49,107][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:50,836][__main__][INFO] - Iteration 501 took 22s (35.44% Gen, 57.03% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 57m 18s. Estimated total time: 19h 9m 19s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 33s. [2025-11-13 11:18:50,838][__main__][INFO] - Starting iteration 501. [2025-11-13 11:18:50,842][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:18:50,842][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:58,976][__main__][INFO] - Number of regex retries in iteration 501: 0 [2025-11-13 11:18:58,977][__main__][INFO] - agents played in iteration 501 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:18:59,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:59,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:59,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:59,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:59,547][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:59,547][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:00,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:00,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:00,867][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:01,524][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:01,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:02,178][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:02,828][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:03,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:03,807][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:04,133][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:04,459][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:04,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:05,437][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:05,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:06,088][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:06,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:07,069][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:07,394][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:08,046][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:08,372][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:09,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:09,351][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:10,002][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:10,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:10,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:11,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:12,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:12,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:12,071][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:12,957][__main__][INFO] - Iteration 502 took 22s (36.78% Gen, 59.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 13m 25s. Estimated total time: 18h 25m 48s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 51s, 500 more iterations: 3h 4m 18s. [2025-11-13 11:19:12,959][__main__][INFO] - Starting iteration 502. [2025-11-13 11:19:12,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:12,963][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:21,421][__main__][INFO] - Number of regex retries in iteration 502: 0 [2025-11-13 11:19:21,421][__main__][INFO] - agents played in iteration 502 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:19:21,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:21,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:21,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:21,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:21,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:21,985][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:22,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:23,003][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:23,329][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:23,988][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:24,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:24,639][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:24,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:25,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:25,939][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:26,264][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:26,590][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:26,924][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:27,251][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:27,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:28,562][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:28,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:29,875][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:30,201][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:30,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:32,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:32,481][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:33,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:33,854][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:34,571][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:34,572][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:34,573][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:35,391][__main__][INFO] - Iteration 503 took 22s (37.71% Gen, 58.64% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 28m 42s. Estimated total time: 18h 41m 28s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 22s, 500 more iterations: 3h 6m 54s. [2025-11-13 11:19:35,393][__main__][INFO] - Starting iteration 503. [2025-11-13 11:19:35,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:35,397][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:44,215][__main__][INFO] - Number of regex retries in iteration 503: 0 [2025-11-13 11:19:44,216][__main__][INFO] - agents played in iteration 503 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:19:44,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:44,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:44,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:44,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:44,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:44,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:45,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:45,755][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:46,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:46,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:47,058][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:47,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:47,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:48,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:48,375][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:48,700][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:49,031][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:49,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:49,685][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:50,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:50,662][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:50,987][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:51,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:51,964][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:52,617][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:53,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:54,255][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:54,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:55,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:55,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:56,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:57,282][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:57,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:57,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:58,124][__main__][INFO] - Iteration 504 took 22s (38.80% Gen, 57.50% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 43m 19s. Estimated total time: 18h 56m 28s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 52s, 500 more iterations: 3h 9m 24s. [2025-11-13 11:19:58,126][__main__][INFO] - Starting iteration 504. [2025-11-13 11:19:58,129][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:58,130][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:07,431][__main__][INFO] - Number of regex retries in iteration 504: 0 [2025-11-13 11:20:07,433][__main__][INFO] - agents played in iteration 504 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:20:07,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:07,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:07,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:07,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:07,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:07,989][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:08,718][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:09,027][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:09,352][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:10,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:10,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:10,664][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:10,988][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:11,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:11,640][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:11,966][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:12,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:12,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:13,269][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:14,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:14,900][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:15,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:15,553][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:15,879][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:16,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:17,185][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:17,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:17,837][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:18,488][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:18,814][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:19,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:19,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:20,560][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:20,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:20,563][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:21,537][__main__][INFO] - Iteration 505 took 23s (39.75% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 16m 55s. Estimated total time: 19h 30m 27s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 4s. [2025-11-13 11:20:21,540][__main__][INFO] - Starting iteration 505. [2025-11-13 11:20:21,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:21,543][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:30,910][__main__][INFO] - Number of regex retries in iteration 505: 0 [2025-11-13 11:20:30,911][__main__][INFO] - agents played in iteration 505 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:20:31,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:31,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:31,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:31,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:31,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:31,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:32,193][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:32,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:32,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:33,475][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:34,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:34,472][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:34,797][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:35,122][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:35,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:36,430][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:37,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:37,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:38,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:38,714][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:39,695][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:40,020][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:40,359][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:40,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:41,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:41,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:41,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:41,983][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:42,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:42,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:43,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:44,049][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:44,050][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:44,052][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:44,849][__main__][INFO] - Iteration 506 took 23s (40.19% Gen, 56.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 11m 28s. Estimated total time: 19h 25m 23s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 13s. [2025-11-13 11:20:44,852][__main__][INFO] - Starting iteration 506. [2025-11-13 11:20:44,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:44,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:53,812][__main__][INFO] - Number of regex retries in iteration 506: 0 [2025-11-13 11:20:53,812][__main__][INFO] - agents played in iteration 506 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:20:54,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:54,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:54,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:54,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:54,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:54,349][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:55,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:56,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:57,009][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:57,986][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:58,646][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:58,973][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:59,296][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:59,622][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:59,947][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:00,605][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:00,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:01,262][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:01,588][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:02,575][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:02,904][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:03,563][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:03,889][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:04,554][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:04,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:05,207][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:05,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:06,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:06,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:06,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:06,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:07,815][__main__][INFO] - Iteration 507 took 22s (39.01% Gen, 57.36% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 53m 46s. Estimated total time: 19h 8m 4s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 20s. [2025-11-13 11:21:07,817][__main__][INFO] - Starting iteration 507. [2025-11-13 11:21:07,820][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:07,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:16,590][__main__][INFO] - Number of regex retries in iteration 507: 0 [2025-11-13 11:21:16,590][__main__][INFO] - agents played in iteration 507 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:21:17,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:17,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:17,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:17,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:17,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:17,143][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:17,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:18,144][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:18,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:18,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:19,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:19,460][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:20,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:21,095][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:21,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:21,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:22,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:24,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:24,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:25,019][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:25,345][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:25,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:26,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:26,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:26,979][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:27,305][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:27,631][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:27,957][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:28,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:28,987][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:29,695][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:29,697][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:29,699][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:30,575][__main__][INFO] - Iteration 508 took 22s (38.54% Gen, 57.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 43m 6s. Estimated total time: 18h 57m 47s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 55s, 500 more iterations: 3h 9m 37s. [2025-11-13 11:21:30,577][__main__][INFO] - Starting iteration 508. [2025-11-13 11:21:30,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:30,581][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:40,110][__main__][INFO] - Number of regex retries in iteration 508: 0 [2025-11-13 11:21:40,111][__main__][INFO] - agents played in iteration 508 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:21:40,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:40,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:40,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:40,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:40,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:40,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:41,375][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:41,672][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:41,998][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:42,323][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:42,648][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:42,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:43,299][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:43,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:43,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:44,276][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:44,928][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:45,254][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:45,581][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:45,906][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:46,232][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:46,557][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:46,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:47,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:47,867][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:48,195][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:48,524][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:48,850][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:49,176][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:49,501][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:50,486][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:50,812][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:51,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:51,465][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:51,792][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:52,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:53,187][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:53,189][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:53,191][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:54,060][__main__][INFO] - Iteration 509 took 23s (40.59% Gen, 55.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 57s. Estimated total time: 19h 34m 1s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 40s. [2025-11-13 11:21:54,062][__main__][INFO] - Starting iteration 509. [2025-11-13 11:21:54,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:54,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:02,820][__main__][INFO] - Number of regex retries in iteration 509: 0 [2025-11-13 11:22:02,821][__main__][INFO] - agents played in iteration 509 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:22:03,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:03,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:03,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:03,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:03,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:03,379][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:04,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:04,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:04,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:05,046][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:05,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:05,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:06,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:06,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:07,983][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:08,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:08,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:09,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:09,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:10,595][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:10,922][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:11,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:11,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:11,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:12,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:12,565][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:12,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:13,544][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:13,871][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:14,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:15,226][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:15,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:15,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:15,922][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:16,734][__main__][INFO] - Iteration 510 took 22s (38.62% Gen, 57.79% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 38m 3s. Estimated total time: 18h 53m 30s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 47s, 500 more iterations: 3h 8m 55s. [2025-11-13 11:22:16,736][__main__][INFO] - Starting iteration 510. [2025-11-13 11:22:16,739][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:16,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:25,728][__main__][INFO] - Number of regex retries in iteration 510: 0 [2025-11-13 11:22:25,728][__main__][INFO] - agents played in iteration 510 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:22:26,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:26,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:26,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:26,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:26,294][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:26,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:26,986][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:27,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:27,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:28,265][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:28,593][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:28,918][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:29,243][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:29,896][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:30,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:30,547][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:31,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:31,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:31,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:32,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:33,165][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:33,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:34,148][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:34,800][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:35,126][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:36,758][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:37,084][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:37,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:38,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:38,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:38,841][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:38,842][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:40,491][__main__][INFO] - Iteration 511 took 23s (37.84% Gen, 55.21% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 31m 47s. Estimated total time: 19h 47m 37s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 56s. [2025-11-13 11:22:40,493][__main__][INFO] - Starting iteration 511. [2025-11-13 11:22:40,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:22:40,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:50,232][__main__][INFO] - Number of regex retries in iteration 511: 0 [2025-11-13 11:22:50,233][__main__][INFO] - agents played in iteration 511 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:22:50,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:50,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:50,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:50,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:50,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:50,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:51,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:51,823][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:52,149][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:52,474][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:52,799][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:53,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:53,449][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:53,781][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:55,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:55,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:56,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:57,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:57,709][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:58,042][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:58,367][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:58,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:59,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:59,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:59,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:59,997][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:00,323][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:00,648][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:00,974][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:01,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:01,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:02,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:03,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:03,477][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:03,479][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:04,327][__main__][INFO] - Iteration 512 took 23s (40.85% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 35m 21s. Estimated total time: 19h 51m 36s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 36s. [2025-11-13 11:23:04,329][__main__][INFO] - Starting iteration 512. [2025-11-13 11:23:04,332][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:04,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:13,268][__main__][INFO] - Number of regex retries in iteration 512: 0 [2025-11-13 11:23:13,268][__main__][INFO] - agents played in iteration 512 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:23:13,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:13,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:13,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:13,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:13,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:13,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:14,559][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:14,868][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:15,194][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:15,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:16,171][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:16,821][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:17,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:17,810][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:18,135][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:18,786][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:19,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:19,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:20,100][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:20,428][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:20,754][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:21,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:21,748][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:22,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:23,048][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:23,374][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:24,027][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:24,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:24,679][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:25,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:25,708][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:26,436][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:26,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:26,439][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:27,225][__main__][INFO] - Iteration 513 took 22s (39.03% Gen, 57.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 48m 3s. Estimated total time: 19h 4m 41s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 46s. [2025-11-13 11:23:27,227][__main__][INFO] - Starting iteration 513. [2025-11-13 11:23:27,229][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:27,230][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:36,710][__main__][INFO] - Number of regex retries in iteration 513: 0 [2025-11-13 11:23:36,711][__main__][INFO] - agents played in iteration 513 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:23:37,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:37,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:37,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:37,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:37,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:37,267][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:38,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:38,604][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:38,927][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:39,254][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:39,581][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:39,906][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:40,230][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:40,555][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:40,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:41,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:41,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:43,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:43,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:43,813][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:44,137][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:44,463][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:45,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:45,767][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:46,093][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:46,418][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:47,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:47,397][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:48,049][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:48,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:49,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:49,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:49,790][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:49,792][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:50,594][__main__][INFO] - Iteration 514 took 23s (40.58% Gen, 55.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 11m 15s. Estimated total time: 19h 28m 16s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 42s. [2025-11-13 11:23:50,596][__main__][INFO] - Starting iteration 514. [2025-11-13 11:23:50,599][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:50,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:00,070][__main__][INFO] - Number of regex retries in iteration 514: 0 [2025-11-13 11:24:00,071][__main__][INFO] - agents played in iteration 514 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:24:00,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:00,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:00,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:00,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:00,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:00,638][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:01,650][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:01,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:02,628][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:02,954][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:03,279][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:03,604][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:03,929][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:04,254][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:04,906][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:05,233][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:05,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:05,887][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:06,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:07,191][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:07,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:08,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:08,507][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:08,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:09,152][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:09,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:09,804][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:10,132][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:10,458][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:11,435][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:11,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:12,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:13,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:13,165][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:13,166][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:14,028][__main__][INFO] - Iteration 515 took 23s (40.42% Gen, 55.89% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 14m 7s. Estimated total time: 19h 31m 31s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 15s. [2025-11-13 11:24:14,031][__main__][INFO] - Starting iteration 515. [2025-11-13 11:24:14,034][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:14,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:22,912][__main__][INFO] - Number of regex retries in iteration 515: 0 [2025-11-13 11:24:22,912][__main__][INFO] - agents played in iteration 515 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:24:23,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:23,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:23,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:23,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:23,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:23,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:24,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:24,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:25,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:25,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:26,112][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:26,765][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:27,089][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:27,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:27,741][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:28,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:28,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:28,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:29,051][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:29,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:30,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:30,703][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:31,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:31,357][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:31,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:32,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:32,340][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:32,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:33,647][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:33,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:34,296][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:34,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:35,334][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:36,021][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:36,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:36,024][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:36,974][__main__][INFO] - Iteration 516 took 22s (38.70% Gen, 57.16% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 49m 15s. Estimated total time: 19h 7m 2s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 10s. [2025-11-13 11:24:36,976][__main__][INFO] - Starting iteration 516. [2025-11-13 11:24:36,979][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:36,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:46,227][__main__][INFO] - Number of regex retries in iteration 516: 0 [2025-11-13 11:24:46,228][__main__][INFO] - agents played in iteration 516 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:24:46,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:46,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:46,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:46,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:46,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:46,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:47,509][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:48,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:48,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:49,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:49,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:50,086][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:50,738][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:51,064][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:51,391][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:51,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:52,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:53,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:53,998][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:54,324][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:54,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:54,976][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:55,302][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:55,629][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:55,955][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:56,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:56,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:57,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:58,613][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:59,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:59,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:59,320][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:00,190][__main__][INFO] - Iteration 517 took 23s (39.84% Gen, 56.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 2m 25s. Estimated total time: 19h 20m 35s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 25s. [2025-11-13 11:25:00,191][__main__][INFO] - Starting iteration 517. [2025-11-13 11:25:00,195][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:00,195][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:09,111][__main__][INFO] - Number of regex retries in iteration 517: 0 [2025-11-13 11:25:09,112][__main__][INFO] - agents played in iteration 517 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:25:09,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:09,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:09,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:09,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:09,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:09,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:10,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:10,700][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:11,035][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:11,363][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:11,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:12,013][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:12,340][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:12,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:13,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:13,650][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:13,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:14,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:14,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:14,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:15,286][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:15,612][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:15,943][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:16,595][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:16,921][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:17,247][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:17,574][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:17,902][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:18,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:19,212][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:19,538][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:19,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:20,196][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:20,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:21,544][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:22,266][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:22,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:22,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:23,252][__main__][INFO] - Iteration 518 took 23s (38.67% Gen, 57.06% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 54m 20s. Estimated total time: 19h 12m 53s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 8s. [2025-11-13 11:25:23,253][__main__][INFO] - Starting iteration 518. [2025-11-13 11:25:23,256][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:23,257][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:31,950][__main__][INFO] - Number of regex retries in iteration 518: 0 [2025-11-13 11:25:31,951][__main__][INFO] - agents played in iteration 518 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:25:32,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,509][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:32,509][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:33,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:33,871][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:34,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:34,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:34,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:35,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:35,500][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:36,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:37,135][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:37,459][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:37,784][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:38,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:38,440][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:39,091][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:39,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:41,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:42,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:43,008][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:43,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:44,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:45,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:45,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:45,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:45,968][__main__][INFO] - Iteration 519 took 22s (38.28% Gen, 57.76% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 36m 42s. Estimated total time: 18h 55m 38s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 51s, 500 more iterations: 3h 9m 16s. [2025-11-13 11:25:45,970][__main__][INFO] - Starting iteration 519. [2025-11-13 11:25:45,974][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:45,974][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:55,141][__main__][INFO] - Number of regex retries in iteration 519: 0 [2025-11-13 11:25:55,141][__main__][INFO] - agents played in iteration 519 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:25:55,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:55,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:55,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:55,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:55,722][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:55,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:56,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:57,068][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:57,718][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:58,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:59,354][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:59,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:00,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:00,347][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:00,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:01,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:01,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:01,992][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:02,318][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:02,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:02,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:03,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:03,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:04,276][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:04,930][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:06,230][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:06,556][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:06,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:07,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:08,303][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:08,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:08,306][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:09,231][__main__][INFO] - Iteration 520 took 23s (39.42% Gen, 56.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 3m 35s. Estimated total time: 19h 22m 54s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 49s. [2025-11-13 11:26:09,233][__main__][INFO] - Starting iteration 520. [2025-11-13 11:26:09,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:09,237][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:18,398][__main__][INFO] - Number of regex retries in iteration 520: 0 [2025-11-13 11:26:18,399][__main__][INFO] - agents played in iteration 520 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:26:18,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:18,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:18,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:18,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:18,958][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:18,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:19,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:20,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:20,662][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:21,326][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:21,653][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:21,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:22,314][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:22,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:23,294][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:23,972][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:24,300][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:24,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:24,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:25,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:25,942][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:26,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:27,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:27,574][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:27,900][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:28,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:28,552][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:28,878][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:29,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:29,530][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:29,858][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:30,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:30,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:31,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:31,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:31,626][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:33,387][__main__][INFO] - Iteration 521 took 24s (37.93% Gen, 54.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 47m 50s. Estimated total time: 20h 7m 34s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 15s. [2025-11-13 11:26:33,389][__main__][INFO] - Starting iteration 521. [2025-11-13 11:26:33,392][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:26:33,392][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:42,474][__main__][INFO] - Number of regex retries in iteration 521: 0 [2025-11-13 11:26:42,474][__main__][INFO] - agents played in iteration 521 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:26:42,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:42,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:42,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:43,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:43,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:43,039][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:43,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:44,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:44,392][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:44,722][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:45,050][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:45,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:46,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:46,366][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:46,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:47,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:47,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:47,673][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:47,998][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:48,323][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:48,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:48,977][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:49,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:49,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:49,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:50,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:50,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:50,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:51,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:51,914][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:52,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:52,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:53,219][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:54,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:54,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:55,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:55,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:55,630][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:56,489][__main__][INFO] - Iteration 522 took 23s (39.32% Gen, 56.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 54m 49s. Estimated total time: 19h 14m 56s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 29s. [2025-11-13 11:26:56,492][__main__][INFO] - Starting iteration 522. [2025-11-13 11:26:56,495][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:26:56,495][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:05,340][__main__][INFO] - Number of regex retries in iteration 522: 0 [2025-11-13 11:27:05,341][__main__][INFO] - agents played in iteration 522 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:27:05,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:05,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:06,639][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:07,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:07,918][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:09,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:09,884][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:10,209][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:10,537][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:11,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:12,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:12,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:13,151][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:13,480][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:13,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:14,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:14,786][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:15,114][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:15,446][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:15,770][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:16,095][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:16,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:16,750][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:17,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:17,780][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:18,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:18,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:18,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:19,426][__main__][INFO] - Iteration 523 took 22s (38.57% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 46m 6s. Estimated total time: 19h 6m 36s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 6s. [2025-11-13 11:27:19,428][__main__][INFO] - Starting iteration 523. [2025-11-13 11:27:19,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:19,432][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:28,386][__main__][INFO] - Number of regex retries in iteration 523: 0 [2025-11-13 11:27:28,387][__main__][INFO] - agents played in iteration 523 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:27:28,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:28,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:28,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:28,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:28,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:28,942][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:29,680][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:30,302][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:30,630][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:31,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:31,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:32,586][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:32,912][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:33,239][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:33,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:34,216][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:34,544][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:34,870][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:35,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:35,521][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:36,173][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:36,499][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:36,825][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:37,152][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:37,480][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:38,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:38,456][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:39,433][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:39,759][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:40,085][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:40,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:41,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:41,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:41,486][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:42,375][__main__][INFO] - Iteration 524 took 22s (39.03% Gen, 57.09% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 46m 21s. Estimated total time: 19h 7m 14s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 12s. [2025-11-13 11:27:42,377][__main__][INFO] - Starting iteration 524. [2025-11-13 11:27:42,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:42,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:51,273][__main__][INFO] - Number of regex retries in iteration 524: 0 [2025-11-13 11:27:51,274][__main__][INFO] - agents played in iteration 524 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:27:51,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:51,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:51,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:51,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:51,831][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:51,831][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:52,539][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:52,840][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:53,822][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:54,150][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:54,476][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:55,459][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:55,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:56,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:56,442][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:57,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:57,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:57,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:58,727][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:59,055][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:59,382][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:59,707][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:00,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:00,686][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:01,012][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:01,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:01,668][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:01,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:02,649][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:02,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:03,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:04,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:04,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:04,404][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:05,288][__main__][INFO] - Iteration 525 took 22s (38.82% Gen, 57.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 44m 10s. Estimated total time: 19h 5m 25s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 54s. [2025-11-13 11:28:05,290][__main__][INFO] - Starting iteration 525. [2025-11-13 11:28:05,294][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:05,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:14,149][__main__][INFO] - Number of regex retries in iteration 525: 0 [2025-11-13 11:28:14,150][__main__][INFO] - agents played in iteration 525 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:28:14,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:14,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:14,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:14,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:14,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:14,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:15,449][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:16,073][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:16,725][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:17,057][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:17,385][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:18,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:18,704][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:19,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:19,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:20,014][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:20,342][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:20,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:21,003][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:21,329][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:21,657][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:21,983][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:22,309][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:22,637][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:23,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:23,947][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:24,275][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:24,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:24,926][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:25,252][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:25,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:26,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:27,303][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:27,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:27,306][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:28,196][__main__][INFO] - Iteration 526 took 22s (38.66% Gen, 57.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 43m 31s. Estimated total time: 19h 5m 10s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 51s. [2025-11-13 11:28:28,198][__main__][INFO] - Starting iteration 526. [2025-11-13 11:28:28,201][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:28,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:37,430][__main__][INFO] - Number of regex retries in iteration 526: 0 [2025-11-13 11:28:37,430][__main__][INFO] - agents played in iteration 526 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:28:37,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:37,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:37,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:37,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:37,986][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:37,987][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:38,731][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:39,689][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:40,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:40,345][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:41,660][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:41,986][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:42,311][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:43,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:43,961][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:44,289][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:44,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:45,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:45,595][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:45,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:46,247][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:46,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:47,224][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:47,874][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:48,524][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:48,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:49,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:49,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:50,583][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:50,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:50,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:51,456][__main__][INFO] - Iteration 527 took 23s (39.68% Gen, 56.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 44s. Estimated total time: 19h 22m 46s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 47s. [2025-11-13 11:28:51,458][__main__][INFO] - Starting iteration 527. [2025-11-13 11:28:51,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:51,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:00,491][__main__][INFO] - Number of regex retries in iteration 527: 0 [2025-11-13 11:29:00,492][__main__][INFO] - agents played in iteration 527 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:29:00,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:00,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:01,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:01,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:01,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:01,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:02,076][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:02,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:02,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:03,053][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:03,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:03,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:04,034][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:04,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:04,691][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:05,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:05,342][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:05,668][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:05,995][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:06,322][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:06,649][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:06,975][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:08,283][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:08,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:08,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:09,273][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:09,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:09,927][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:10,262][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:10,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:10,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:11,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:11,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:12,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:12,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:13,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:13,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:13,663][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:14,541][__main__][INFO] - Iteration 528 took 23s (39.12% Gen, 57.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 51m 38s. Estimated total time: 19h 14m 2s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 20s. [2025-11-13 11:29:14,543][__main__][INFO] - Starting iteration 528. [2025-11-13 11:29:14,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:14,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:23,611][__main__][INFO] - Number of regex retries in iteration 528: 0 [2025-11-13 11:29:23,611][__main__][INFO] - agents played in iteration 528 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:29:24,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:24,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:24,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:24,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:24,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:24,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:24,907][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:25,204][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:25,857][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:26,183][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:26,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:27,173][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:28,484][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:28,804][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:29,131][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:29,457][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:29,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:30,111][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:30,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:30,765][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:31,091][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:31,417][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:32,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:32,393][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:32,725][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:33,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:34,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:34,352][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:34,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:35,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:35,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:36,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:36,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:36,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:36,770][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:37,687][__main__][INFO] - Iteration 529 took 23s (39.17% Gen, 56.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 54m 15s. Estimated total time: 19h 17m 3s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 50s. [2025-11-13 11:29:37,689][__main__][INFO] - Starting iteration 529. [2025-11-13 11:29:37,693][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:37,693][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:46,789][__main__][INFO] - Number of regex retries in iteration 529: 0 [2025-11-13 11:29:46,791][__main__][INFO] - agents played in iteration 529 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:29:47,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:47,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:47,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:47,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:47,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:47,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:48,078][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:48,374][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:49,027][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:49,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:49,679][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:50,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:50,658][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:50,984][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:52,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:52,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:53,596][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:53,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:54,249][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:54,578][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:54,909][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:55,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:55,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:56,560][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:56,885][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:57,212][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:57,541][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:57,869][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:58,198][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:58,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:59,200][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:59,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:59,917][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:59,918][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:00,891][__main__][INFO] - Iteration 530 took 23s (39.22% Gen, 56.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 56m 46s. Estimated total time: 19h 19m 57s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 19s. [2025-11-13 11:30:00,893][__main__][INFO] - Starting iteration 530. [2025-11-13 11:30:00,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:00,896][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:09,389][__main__][INFO] - Number of regex retries in iteration 530: 0 [2025-11-13 11:30:09,390][__main__][INFO] - agents played in iteration 530 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:30:09,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:09,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:09,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:09,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:09,955][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:09,955][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:11,005][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:11,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:11,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:12,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:12,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:13,289][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:13,941][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:14,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:14,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:14,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:15,249][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:15,902][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:16,227][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:16,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:16,879][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:17,206][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:17,531][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:18,184][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:18,509][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:18,834][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:19,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:19,485][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:19,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:20,479][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:21,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:21,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:22,528][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:22,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:22,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:24,311][__main__][INFO] - Iteration 531 took 23s (36.27% Gen, 56.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 7m 15s. Estimated total time: 19h 30m 49s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 8s. [2025-11-13 11:30:24,313][__main__][INFO] - Starting iteration 531. [2025-11-13 11:30:24,317][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:30:24,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:33,870][__main__][INFO] - Number of regex retries in iteration 531: 0 [2025-11-13 11:30:33,871][__main__][INFO] - agents played in iteration 531 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:30:34,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:34,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:34,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:34,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:34,447][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:34,447][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:35,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:35,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:35,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:36,444][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:36,775][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:37,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:38,075][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:38,405][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:39,059][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:39,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:39,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:40,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:40,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:41,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:41,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:41,996][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:42,651][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:42,980][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:43,308][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:43,969][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:44,620][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:44,945][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:45,276][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:45,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:46,269][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:46,974][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:46,975][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:46,977][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:47,844][__main__][INFO] - Iteration 532 took 23s (40.61% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 12m 26s. Estimated total time: 19h 36m 24s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 4s. [2025-11-13 11:30:47,846][__main__][INFO] - Starting iteration 532. [2025-11-13 11:30:47,849][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:30:47,850][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:56,230][__main__][INFO] - Number of regex retries in iteration 532: 0 [2025-11-13 11:30:56,231][__main__][INFO] - agents played in iteration 532 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:30:56,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:56,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:56,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:56,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:56,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:56,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:57,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:58,156][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:58,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:58,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:00,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:00,777][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:01,104][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:01,432][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:02,088][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:02,413][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:03,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:03,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:04,044][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:04,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:05,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:06,001][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:06,326][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:06,652][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:06,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:07,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:07,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:07,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:08,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:09,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:09,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:09,358][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:10,307][__main__][INFO] - Iteration 533 took 22s (37.32% Gen, 58.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 18m 37s. Estimated total time: 18h 42m 57s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 25s, 500 more iterations: 3h 7m 9s. [2025-11-13 11:31:10,309][__main__][INFO] - Starting iteration 533. [2025-11-13 11:31:10,313][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:10,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:18,649][__main__][INFO] - Number of regex retries in iteration 533: 0 [2025-11-13 11:31:18,650][__main__][INFO] - agents played in iteration 533 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:31:19,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:19,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:19,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:19,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:19,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:19,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:20,265][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:20,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:20,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:21,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:21,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:22,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:22,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:22,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:23,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:23,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:24,181][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:24,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:24,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:25,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:25,488][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:25,814][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:26,469][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:26,798][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:27,128][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:27,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:27,782][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:28,110][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:28,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:29,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:30,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:31,149][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:31,854][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:31,855][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:31,857][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:32,753][__main__][INFO] - Iteration 534 took 22s (37.15% Gen, 58.85% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 20s. Estimated total time: 18h 42m 3s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 24s, 500 more iterations: 3h 7m 0s. [2025-11-13 11:31:32,755][__main__][INFO] - Starting iteration 534. [2025-11-13 11:31:32,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:32,759][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:41,452][__main__][INFO] - Number of regex retries in iteration 534: 0 [2025-11-13 11:31:41,453][__main__][INFO] - agents played in iteration 534 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:31:41,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:41,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:41,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:42,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:42,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:42,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:42,757][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:43,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:43,381][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:44,034][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:44,361][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:44,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:45,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:45,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:46,325][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:46,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:46,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:47,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:47,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:47,954][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:48,280][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:48,612][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:48,937][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:49,266][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:49,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:50,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:50,579][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:50,903][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:51,228][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:51,554][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:51,885][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:52,224][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:52,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:52,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:53,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:53,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:54,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:54,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:54,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:55,496][__main__][INFO] - Iteration 535 took 22s (38.23% Gen, 57.82% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 31m 49s. Estimated total time: 18h 56m 55s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 29s. [2025-11-13 11:31:55,498][__main__][INFO] - Starting iteration 535. [2025-11-13 11:31:55,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:55,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:03,989][__main__][INFO] - Number of regex retries in iteration 535: 0 [2025-11-13 11:32:03,989][__main__][INFO] - agents played in iteration 535 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:32:04,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:04,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:04,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:04,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:04,543][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:04,543][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:05,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:05,581][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:05,907][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:06,234][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:06,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:06,886][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:07,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:07,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:08,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:08,843][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:09,169][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:09,496][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:10,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:10,476][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:11,129][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:11,459][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:11,785][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:12,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:12,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:13,095][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:13,424][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:14,079][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:14,739][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:15,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:15,723][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:16,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:17,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:17,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:17,164][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:18,129][__main__][INFO] - Iteration 536 took 22s (37.51% Gen, 58.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 54s. Estimated total time: 18h 51m 22s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 42s, 500 more iterations: 3h 8m 33s. [2025-11-13 11:32:18,131][__main__][INFO] - Starting iteration 536. [2025-11-13 11:32:18,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:18,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:26,917][__main__][INFO] - Number of regex retries in iteration 536: 0 [2025-11-13 11:32:26,918][__main__][INFO] - agents played in iteration 536 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:32:27,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:27,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:27,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:27,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:27,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:27,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:29,516][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:30,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:31,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:31,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:32,453][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:33,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:33,432][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:33,758][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:34,083][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:34,413][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:34,740][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:35,067][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:36,046][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:36,372][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:36,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:37,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:37,687][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:38,339][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:38,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:39,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:40,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:40,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:40,050][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:41,008][__main__][INFO] - Iteration 537 took 22s (38.39% Gen, 57.41% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 37m 52s. Estimated total time: 19h 3m 44s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 37s. [2025-11-13 11:32:41,011][__main__][INFO] - Starting iteration 537. [2025-11-13 11:32:41,014][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:41,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:50,036][__main__][INFO] - Number of regex retries in iteration 537: 0 [2025-11-13 11:32:50,037][__main__][INFO] - agents played in iteration 537 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:32:50,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:50,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:50,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:50,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:50,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:50,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:51,328][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:51,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:52,278][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:52,605][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:52,931][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:53,583][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:53,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:54,234][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:54,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:56,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:56,523][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:56,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:57,169][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:57,495][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:58,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:58,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:58,802][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:59,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:59,454][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:59,779][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:00,108][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:00,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:00,761][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:01,093][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:01,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:01,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:02,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:03,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:03,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:03,224][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:04,141][__main__][INFO] - Iteration 538 took 23s (39.01% Gen, 57.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 50m 10s. Estimated total time: 19h 16m 25s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 44s. [2025-11-13 11:33:04,143][__main__][INFO] - Starting iteration 538. [2025-11-13 11:33:04,146][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:04,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:12,642][__main__][INFO] - Number of regex retries in iteration 538: 0 [2025-11-13 11:33:12,643][__main__][INFO] - agents played in iteration 538 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:33:13,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:13,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:13,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:13,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:13,200][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:13,200][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:13,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:14,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:15,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:15,868][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:16,520][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:16,847][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:18,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:18,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:18,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:19,459][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:19,785][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:20,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:20,763][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:21,090][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:21,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:21,746][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:22,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:22,729][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:23,055][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:23,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:24,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:25,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:25,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:25,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:25,831][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:26,738][__main__][INFO] - Iteration 539 took 22s (37.60% Gen, 58.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 22m 59s. Estimated total time: 18h 49m 36s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 39s, 500 more iterations: 3h 8m 16s. [2025-11-13 11:33:26,740][__main__][INFO] - Starting iteration 539. [2025-11-13 11:33:26,743][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:26,744][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:35,122][__main__][INFO] - Number of regex retries in iteration 539: 0 [2025-11-13 11:33:35,122][__main__][INFO] - agents played in iteration 539 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:33:35,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:35,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:35,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:35,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:35,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:35,659][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:36,635][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:37,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:37,947][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:38,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:38,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:39,278][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:39,936][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:40,257][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:40,584][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:41,241][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:41,564][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:41,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:42,546][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:42,871][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:43,524][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:44,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:44,507][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:44,833][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:45,160][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:45,486][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:46,143][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:46,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:46,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:47,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:48,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:48,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:48,223][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:49,124][__main__][INFO] - Iteration 540 took 22s (37.43% Gen, 58.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 12m 5s. Estimated total time: 18h 39m 5s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 18s, 500 more iterations: 3h 6m 30s. [2025-11-13 11:33:49,126][__main__][INFO] - Starting iteration 540. [2025-11-13 11:33:49,129][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:49,130][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:57,830][__main__][INFO] - Number of regex retries in iteration 540: 0 [2025-11-13 11:33:57,831][__main__][INFO] - agents played in iteration 540 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:33:58,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:58,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:58,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:58,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:58,385][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:58,385][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:00,359][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:00,685][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:01,671][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:02,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:02,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:02,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:02,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:03,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:03,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:04,956][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:05,282][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:05,932][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:06,259][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:06,585][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:06,909][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:07,237][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:07,563][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:07,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:08,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:08,543][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:08,869][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:09,195][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:09,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:10,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:11,041][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:11,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:11,044][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:12,784][__main__][INFO] - Iteration 541 took 23s (36.78% Gen, 55.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 15m 24s. Estimated total time: 19h 42m 47s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 7s. [2025-11-13 11:34:12,786][__main__][INFO] - Starting iteration 541. [2025-11-13 11:34:12,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:34:12,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:21,411][__main__][INFO] - Number of regex retries in iteration 541: 0 [2025-11-13 11:34:21,412][__main__][INFO] - agents played in iteration 541 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:34:21,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:21,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:21,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:21,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:21,969][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:21,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:22,651][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:22,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:23,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:23,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:23,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:24,259][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:24,589][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:25,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:25,921][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:26,254][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:26,583][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:26,912][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:27,239][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:28,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:28,546][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:28,873][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:29,199][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:29,531][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:30,177][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:31,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:32,135][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:32,469][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:33,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:33,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:34,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:34,546][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:34,548][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:35,439][__main__][INFO] - Iteration 542 took 22s (38.06% Gen, 57.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 24m 45s. Estimated total time: 18h 52m 31s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 45s. [2025-11-13 11:34:35,441][__main__][INFO] - Starting iteration 542. [2025-11-13 11:34:35,444][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:34:35,445][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:43,641][__main__][INFO] - Number of regex retries in iteration 542: 0 [2025-11-13 11:34:43,642][__main__][INFO] - agents played in iteration 542 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:34:44,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:44,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:44,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:44,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:44,216][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:44,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:44,895][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:45,190][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:45,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:46,170][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:48,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:48,782][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:49,109][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:49,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:49,760][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:50,090][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:50,746][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:51,072][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:51,399][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:52,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:52,378][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:52,704][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:53,359][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:54,014][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:54,341][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:54,668][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:55,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:56,055][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:56,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:56,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:56,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:57,577][__main__][INFO] - Iteration 543 took 22s (37.03% Gen, 59.27% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 58m 33s. Estimated total time: 18h 26m 41s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 53s, 500 more iterations: 3h 4m 26s. [2025-11-13 11:34:57,579][__main__][INFO] - Starting iteration 543. [2025-11-13 11:34:57,582][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:34:57,582][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:06,459][__main__][INFO] - Number of regex retries in iteration 543: 0 [2025-11-13 11:35:06,459][__main__][INFO] - agents played in iteration 543 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:35:06,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:06,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:06,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:06,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:06,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:06,995][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:07,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:08,292][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:08,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:09,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:09,591][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:09,918][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:10,245][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:10,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:10,894][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:11,219][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:11,545][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:12,206][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:12,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:12,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:13,185][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:13,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:14,183][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:14,843][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:15,169][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:15,497][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:15,825][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:16,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:16,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:17,459][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:17,785][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:18,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:18,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:19,518][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:19,519][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:19,521][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:20,373][__main__][INFO] - Iteration 544 took 22s (38.94% Gen, 57.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 31m 7s. Estimated total time: 18h 59m 38s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 59s, 500 more iterations: 3h 9m 56s. [2025-11-13 11:35:20,375][__main__][INFO] - Starting iteration 544. [2025-11-13 11:35:20,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:20,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:28,872][__main__][INFO] - Number of regex retries in iteration 544: 0 [2025-11-13 11:35:28,873][__main__][INFO] - agents played in iteration 544 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:35:29,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:29,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:29,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:29,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:29,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:29,423][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:30,101][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:31,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:31,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:32,351][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:32,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:33,338][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:33,663][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:33,987][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:34,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:35,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:35,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:36,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:36,600][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:36,931][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:37,256][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:37,585][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:37,912][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:38,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:38,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:39,558][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:40,215][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:40,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:41,283][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:41,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:41,964][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:41,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:42,772][__main__][INFO] - Iteration 545 took 22s (37.93% Gen, 58.46% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 10m 53s. Estimated total time: 18h 39m 46s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 19s, 500 more iterations: 3h 6m 37s. [2025-11-13 11:35:42,774][__main__][INFO] - Starting iteration 545. [2025-11-13 11:35:42,777][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:42,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:52,618][__main__][INFO] - Number of regex retries in iteration 545: 0 [2025-11-13 11:35:52,619][__main__][INFO] - agents played in iteration 545 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:35:53,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:53,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:53,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:53,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:53,194][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:53,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:53,867][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:54,820][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:55,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:55,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:56,472][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:56,799][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:57,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:57,778][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:58,102][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:58,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:58,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:59,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:59,409][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:59,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:00,062][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:01,039][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:01,365][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:01,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:02,017][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:02,342][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:02,995][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:03,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:04,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:05,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:05,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:05,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:05,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:06,523][__main__][INFO] - Iteration 546 took 23s (41.44% Gen, 55.10% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 3s. Estimated total time: 19h 47m 20s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 53s. [2025-11-13 11:36:06,525][__main__][INFO] - Starting iteration 546. [2025-11-13 11:36:06,528][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:06,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:16,850][__main__][INFO] - Number of regex retries in iteration 546: 0 [2025-11-13 11:36:16,850][__main__][INFO] - agents played in iteration 546 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:36:17,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:17,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:17,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:17,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:17,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:17,404][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:18,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:18,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:18,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:19,041][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:19,369][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:20,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:20,365][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:20,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:21,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:21,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:22,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:22,329][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:22,655][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:22,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:23,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:23,630][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:23,958][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:24,285][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:24,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:25,586][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:26,236][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:26,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:26,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:27,214][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:27,542][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:28,204][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:28,543][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:29,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:29,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:29,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:29,948][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:30,771][__main__][INFO] - Iteration 547 took 24s (42.57% Gen, 54.03% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 42m 31s. Estimated total time: 20h 12m 12s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 24s, 500 more iterations: 3h 22m 2s. [2025-11-13 11:36:30,774][__main__][INFO] - Starting iteration 547. [2025-11-13 11:36:30,778][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:30,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:40,849][__main__][INFO] - Number of regex retries in iteration 547: 0 [2025-11-13 11:36:40,849][__main__][INFO] - agents played in iteration 547 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:36:41,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,400][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:41,400][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:42,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:42,413][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:43,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:43,393][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:43,719][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:44,044][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:44,371][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:45,354][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:46,016][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:46,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:47,326][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:47,651][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:48,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:48,631][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:48,956][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:49,281][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:49,607][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:49,932][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:50,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:50,584][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:50,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:51,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:51,559][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:52,212][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:52,540][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:53,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:53,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:53,993][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:53,995][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:54,827][__main__][INFO] - Iteration 548 took 24s (41.87% Gen, 54.66% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 32m 27s. Estimated total time: 20h 2m 32s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 25s. [2025-11-13 11:36:54,829][__main__][INFO] - Starting iteration 548. [2025-11-13 11:36:54,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:54,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:05,717][__main__][INFO] - Number of regex retries in iteration 548: 0 [2025-11-13 11:37:05,718][__main__][INFO] - agents played in iteration 548 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:37:06,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:06,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:06,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:06,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:06,276][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:06,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:08,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:08,917][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:09,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:10,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:10,546][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:11,521][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:11,848][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:12,173][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:12,498][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:13,151][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:13,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:13,803][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:14,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:15,110][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:15,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:16,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:16,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:17,413][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:18,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:18,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:18,875][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:18,877][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:19,751][__main__][INFO] - Iteration 549 took 24s (43.68% Gen, 52.81% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 15m 29s. Estimated total time: 20h 45m 59s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 31s, 500 more iterations: 3h 27m 39s. [2025-11-13 11:37:19,753][__main__][INFO] - Starting iteration 549. [2025-11-13 11:37:19,756][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:19,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:30,265][__main__][INFO] - Number of regex retries in iteration 549: 0 [2025-11-13 11:37:30,266][__main__][INFO] - agents played in iteration 549 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:37:30,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:30,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:30,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:30,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:30,817][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:30,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:31,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:31,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:32,797][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:33,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:33,770][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:34,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:34,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:34,745][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:35,073][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:35,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:35,729][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:36,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:36,382][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:36,708][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:37,035][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:37,360][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:37,686][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:38,010][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:38,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:38,660][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:38,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:39,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:40,298][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:40,628][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:41,282][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:41,608][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:41,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:42,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:43,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:43,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:43,357][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:44,185][__main__][INFO] - Iteration 550 took 24s (43.01% Gen, 53.59% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 50m 35s. Estimated total time: 20h 21m 29s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 42s, 500 more iterations: 3h 23m 34s. [2025-11-13 11:37:44,187][__main__][INFO] - Starting iteration 550. [2025-11-13 11:37:44,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:44,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:54,160][__main__][INFO] - Number of regex retries in iteration 550: 0 [2025-11-13 11:37:54,160][__main__][INFO] - agents played in iteration 550 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:37:54,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:54,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:54,733][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:55,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:55,732][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:56,386][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:57,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:57,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:57,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:58,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:58,666][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:59,642][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:59,972][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:00,292][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:00,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:00,943][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:01,267][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:01,919][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:02,244][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:02,569][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:02,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:03,219][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:03,544][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:04,195][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:04,520][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:04,847][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:05,172][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:05,498][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:05,826][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:06,535][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:07,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:07,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:07,256][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:08,816][__main__][INFO] - Iteration 551 took 24s (40.48% Gen, 53.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 0m 3s. Estimated total time: 20h 31m 22s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 2s, 500 more iterations: 3h 25m 13s. [2025-11-13 11:38:08,819][__main__][INFO] - Starting iteration 551. [2025-11-13 11:38:08,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:38:08,822][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:18,840][__main__][INFO] - Number of regex retries in iteration 551: 0 [2025-11-13 11:38:18,840][__main__][INFO] - agents played in iteration 551 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:38:19,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:19,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:19,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:19,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:19,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:19,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:20,125][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:20,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:21,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:21,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:21,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:22,051][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:22,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:22,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:23,026][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:23,351][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:23,679][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:24,004][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:24,655][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:25,630][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:26,281][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:26,606][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:27,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:27,586][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:27,912][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:28,237][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:28,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:28,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:29,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:30,194][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:30,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:31,237][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:31,942][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:31,943][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:31,945][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:32,801][__main__][INFO] - Iteration 552 took 23s (41.78% Gen, 54.65% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 27m 19s. Estimated total time: 19h 59m 2s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 50s. [2025-11-13 11:38:32,803][__main__][INFO] - Starting iteration 552. [2025-11-13 11:38:32,806][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:38:32,806][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:43,264][__main__][INFO] - Number of regex retries in iteration 552: 0 [2025-11-13 11:38:43,265][__main__][INFO] - agents played in iteration 552 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:38:43,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:43,821][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:43,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:44,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:45,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:45,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:45,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:46,134][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:46,462][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:46,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:47,113][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:47,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:48,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:49,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:49,755][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:50,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:50,416][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:50,743][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:51,721][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:52,378][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:52,703][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:53,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:53,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:54,010][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:54,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:54,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:55,719][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:56,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:56,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:56,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:57,325][__main__][INFO] - Iteration 553 took 24s (42.65% Gen, 53.69% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 53m 52s. Estimated total time: 20h 25m 59s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 51s, 500 more iterations: 3h 24m 19s. [2025-11-13 11:38:57,327][__main__][INFO] - Starting iteration 553. [2025-11-13 11:38:57,330][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:38:57,331][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:07,600][__main__][INFO] - Number of regex retries in iteration 553: 0 [2025-11-13 11:39:07,600][__main__][INFO] - agents played in iteration 553 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:39:08,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:08,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:08,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:09,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:09,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:10,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:10,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:10,850][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:11,178][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:11,511][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:12,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:12,821][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:13,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:13,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:14,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:14,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:15,113][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:15,440][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:15,765][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:16,092][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:17,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:17,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:17,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:18,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:18,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:19,029][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:19,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:20,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:20,783][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:20,785][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:20,788][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:21,600][__main__][INFO] - Iteration 554 took 24s (42.31% Gen, 54.33% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 41m 0s. Estimated total time: 20h 13m 32s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 27s, 500 more iterations: 3h 22m 15s. [2025-11-13 11:39:21,602][__main__][INFO] - Starting iteration 554. [2025-11-13 11:39:21,605][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:21,606][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:32,425][__main__][INFO] - Number of regex retries in iteration 554: 0 [2025-11-13 11:39:32,426][__main__][INFO] - agents played in iteration 554 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:39:32,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:32,974][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:32,974][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:33,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:34,321][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:34,655][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:34,982][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:35,977][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:36,634][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:36,959][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:37,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:38,917][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:39,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:40,226][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:40,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:40,882][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:41,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:42,183][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:42,833][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:43,164][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:43,497][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:44,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:44,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:45,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:45,603][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:45,605][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:46,447][__main__][INFO] - Iteration 555 took 24s (43.55% Gen, 53.05% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 9m 12s. Estimated total time: 20h 42m 9s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 24s, 500 more iterations: 3h 27m 1s. [2025-11-13 11:39:46,449][__main__][INFO] - Starting iteration 555. [2025-11-13 11:39:46,452][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:46,453][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:56,865][__main__][INFO] - Number of regex retries in iteration 555: 0 [2025-11-13 11:39:56,865][__main__][INFO] - agents played in iteration 555 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:39:57,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:57,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:57,410][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:58,140][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:58,437][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:58,763][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:59,088][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:59,741][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:00,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:00,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:01,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:02,018][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:02,348][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:02,676][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:03,002][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:03,660][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:03,985][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:04,636][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:04,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:05,624][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:06,278][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:07,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:07,587][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:07,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:08,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:09,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:10,035][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:10,037][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:10,039][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:10,856][__main__][INFO] - Iteration 556 took 24s (42.67% Gen, 53.98% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 46m 52s. Estimated total time: 20h 20m 13s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 40s, 500 more iterations: 3h 23m 22s. [2025-11-13 11:40:10,858][__main__][INFO] - Starting iteration 556. [2025-11-13 11:40:10,860][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:10,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:21,414][__main__][INFO] - Number of regex retries in iteration 556: 0 [2025-11-13 11:40:21,414][__main__][INFO] - agents played in iteration 556 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:40:21,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:21,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:21,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:21,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:21,973][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:21,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:22,700][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:23,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:23,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:23,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:24,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:24,624][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:25,275][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:25,605][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:25,925][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:26,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:27,883][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:28,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:28,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:28,859][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:29,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:29,520][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:29,845][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:30,496][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:31,148][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:32,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:33,108][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:33,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:34,562][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:34,563][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:34,565][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:35,438][__main__][INFO] - Iteration 557 took 24s (42.93% Gen, 53.50% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 55m 11s. Estimated total time: 20h 28m 57s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 57s, 500 more iterations: 3h 24m 49s. [2025-11-13 11:40:35,440][__main__][INFO] - Starting iteration 557. [2025-11-13 11:40:35,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:35,444][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:45,235][__main__][INFO] - Number of regex retries in iteration 557: 0 [2025-11-13 11:40:45,236][__main__][INFO] - agents played in iteration 557 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:40:45,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:45,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:45,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:45,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:45,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:45,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:47,189][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:47,517][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:47,844][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:48,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:48,506][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:48,834][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:49,491][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:49,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:50,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:51,123][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:51,449][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:51,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:52,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:52,440][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:53,106][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:53,425][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:53,755][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:54,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:55,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:56,727][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:57,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:57,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:58,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:58,537][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:58,538][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:59,360][__main__][INFO] - Iteration 558 took 23s (40.94% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 21m 45s. Estimated total time: 19h 55m 54s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 19s. [2025-11-13 11:40:59,363][__main__][INFO] - Starting iteration 558. [2025-11-13 11:40:59,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:59,366][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:09,824][__main__][INFO] - Number of regex retries in iteration 558: 0 [2025-11-13 11:41:09,825][__main__][INFO] - agents played in iteration 558 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:41:10,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:10,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:10,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:10,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:10,395][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:10,395][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:11,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:11,722][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:12,055][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:12,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:12,714][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:13,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:13,706][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:14,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:14,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:15,029][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:15,361][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:15,692][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:16,346][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:16,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:17,337][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:17,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:17,990][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:18,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:18,976][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:19,301][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:19,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:19,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:20,286][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:20,612][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:20,941][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:21,269][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:21,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:22,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:23,014][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:23,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:23,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:23,828][__main__][INFO] - Iteration 559 took 24s (42.75% Gen, 53.93% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 48m 38s. Estimated total time: 20h 23m 12s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 46s, 500 more iterations: 3h 23m 52s. [2025-11-13 11:41:23,831][__main__][INFO] - Starting iteration 559. [2025-11-13 11:41:23,834][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:23,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:34,276][__main__][INFO] - Number of regex retries in iteration 559: 0 [2025-11-13 11:41:34,277][__main__][INFO] - agents played in iteration 559 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:41:34,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:34,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:34,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:34,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:34,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:34,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:35,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:35,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:36,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:36,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:38,182][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:38,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:39,489][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:39,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:40,139][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:40,465][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:40,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:41,448][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:42,109][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:42,429][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:42,754][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:43,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:43,407][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:45,052][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:45,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:46,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:46,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:47,458][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:47,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:47,461][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:48,294][__main__][INFO] - Iteration 560 took 24s (42.69% Gen, 53.90% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 48m 6s. Estimated total time: 20h 23m 4s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 46s, 500 more iterations: 3h 23m 50s. [2025-11-13 11:41:48,296][__main__][INFO] - Starting iteration 560. [2025-11-13 11:41:48,299][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:48,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:58,219][__main__][INFO] - Number of regex retries in iteration 560: 0 [2025-11-13 11:41:58,219][__main__][INFO] - agents played in iteration 560 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:41:58,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:58,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:58,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:58,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:58,765][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:58,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:00,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:00,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:00,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:01,072][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:01,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:01,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:02,049][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:02,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:03,027][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:04,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:04,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:05,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:06,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:06,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:06,932][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:07,585][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:07,910][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:08,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:08,885][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:09,211][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:09,538][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:09,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:10,564][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:11,271][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:11,272][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:11,274][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:12,875][__main__][INFO] - Iteration 561 took 24s (40.36% Gen, 53.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 53m 26s. Estimated total time: 20h 28m 49s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 57s, 500 more iterations: 3h 24m 48s. [2025-11-13 11:42:12,877][__main__][INFO] - Starting iteration 561. [2025-11-13 11:42:12,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:42:12,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:22,648][__main__][INFO] - Number of regex retries in iteration 561: 0 [2025-11-13 11:42:22,648][__main__][INFO] - agents played in iteration 561 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:42:23,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:23,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:23,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:23,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:23,203][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:23,203][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:23,912][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:24,208][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:24,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:24,870][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:25,198][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:25,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:25,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:26,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:26,520][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:26,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:27,185][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:27,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:28,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:28,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:29,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:29,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:29,821][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:30,146][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:31,123][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:31,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:31,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:32,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:32,758][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:33,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:33,746][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:34,075][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:34,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:35,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:35,798][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:35,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:35,801][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:36,654][__main__][INFO] - Iteration 562 took 23s (41.08% Gen, 55.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 12m 59s. Estimated total time: 19h 48m 46s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 7s. [2025-11-13 11:42:36,656][__main__][INFO] - Starting iteration 562. [2025-11-13 11:42:36,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:42:36,660][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:46,334][__main__][INFO] - Number of regex retries in iteration 562: 0 [2025-11-13 11:42:46,334][__main__][INFO] - agents played in iteration 562 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:42:46,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:46,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:46,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:46,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:46,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:46,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:49,209][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:49,861][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:50,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:50,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:51,165][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:51,817][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:52,143][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:52,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:52,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:53,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:53,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:53,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:54,426][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:55,404][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:56,382][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:56,708][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:57,034][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:58,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:58,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:59,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:59,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:59,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:00,327][__main__][INFO] - Iteration 563 took 23s (40.87% Gen, 55.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 16s. Estimated total time: 19h 43m 27s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 14s. [2025-11-13 11:43:00,329][__main__][INFO] - Starting iteration 563. [2025-11-13 11:43:00,332][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:00,333][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:09,409][__main__][INFO] - Number of regex retries in iteration 563: 0 [2025-11-13 11:43:09,409][__main__][INFO] - agents played in iteration 563 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:43:09,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:09,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:09,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:09,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:09,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:09,953][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:10,662][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:10,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:11,287][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:11,610][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:11,938][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:12,590][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:12,917][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:13,899][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:14,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:14,885][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:15,210][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:15,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:16,846][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:17,172][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:17,496][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:17,825][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:18,479][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:19,129][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:19,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:19,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:20,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:20,437][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:20,763][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:21,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:21,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:22,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:22,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:22,511][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:23,390][__main__][INFO] - Iteration 564 took 23s (39.36% Gen, 56.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 36m 24s. Estimated total time: 19h 12m 57s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 9s. [2025-11-13 11:43:23,393][__main__][INFO] - Starting iteration 564. [2025-11-13 11:43:23,396][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:23,397][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:33,208][__main__][INFO] - Number of regex retries in iteration 564: 0 [2025-11-13 11:43:33,209][__main__][INFO] - agents played in iteration 564 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:43:33,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:33,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:33,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:33,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:33,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:33,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:34,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:34,784][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:35,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:35,762][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:36,088][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:36,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:37,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:38,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:38,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:39,375][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:39,700][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:40,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:40,683][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:41,008][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:41,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:41,996][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:42,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:42,971][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:43,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:43,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:44,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:44,608][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:44,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:45,639][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:46,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:46,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:46,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:47,209][__main__][INFO] - Iteration 565 took 23s (41.20% Gen, 55.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 13m 44s. Estimated total time: 19h 50m 41s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 26s. [2025-11-13 11:43:47,212][__main__][INFO] - Starting iteration 565. [2025-11-13 11:43:47,216][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:47,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:57,071][__main__][INFO] - Number of regex retries in iteration 565: 0 [2025-11-13 11:43:57,071][__main__][INFO] - agents played in iteration 565 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:43:57,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:57,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:57,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:57,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:57,632][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:57,632][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:58,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:58,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:58,972][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:59,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:00,605][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:00,934][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:01,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:02,240][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:02,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:03,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:03,882][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:04,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:04,542][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:04,861][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:05,186][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:05,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:05,843][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:06,163][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:06,488][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:06,813][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:07,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:07,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:08,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:08,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:08,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:09,472][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:10,212][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:10,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:10,215][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:11,140][__main__][INFO] - Iteration 566 took 23s (41.19% Gen, 54.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 54s. Estimated total time: 19h 56m 15s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 22s. [2025-11-13 11:44:11,142][__main__][INFO] - Starting iteration 566. [2025-11-13 11:44:11,145][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:11,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:20,110][__main__][INFO] - Number of regex retries in iteration 566: 0 [2025-11-13 11:44:20,111][__main__][INFO] - agents played in iteration 566 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:44:20,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:20,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:21,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:21,685][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:22,343][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:22,668][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:22,993][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:23,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:24,312][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:24,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:24,972][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:25,299][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:25,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:25,956][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:26,282][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:26,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:26,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:27,908][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:28,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:28,559][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:29,210][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:29,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:29,860][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:30,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:30,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:31,486][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:31,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:32,516][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:33,237][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:33,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:33,240][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:34,156][__main__][INFO] - Iteration 567 took 23s (38.96% Gen, 57.05% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 32m 51s. Estimated total time: 19h 10m 35s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 45s. [2025-11-13 11:44:34,158][__main__][INFO] - Starting iteration 567. [2025-11-13 11:44:34,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:34,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:43,519][__main__][INFO] - Number of regex retries in iteration 567: 0 [2025-11-13 11:44:43,520][__main__][INFO] - agents played in iteration 567 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:44:43,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:43,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:44,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:44,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:44,078][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:44,079][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:44,813][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:45,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:45,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:46,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:46,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:46,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:47,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:47,719][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:48,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:48,381][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:48,707][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:49,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:50,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:50,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:50,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:50,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:51,642][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:51,967][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:52,622][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:52,948][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:53,273][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:53,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:53,931][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:54,259][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:54,587][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:54,911][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:55,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:55,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:56,642][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:56,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:56,645][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:57,582][__main__][INFO] - Iteration 568 took 23s (39.95% Gen, 56.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 52m 54s. Estimated total time: 19h 31m 2s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 10s. [2025-11-13 11:44:57,584][__main__][INFO] - Starting iteration 568. [2025-11-13 11:44:57,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:57,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:06,938][__main__][INFO] - Number of regex retries in iteration 568: 0 [2025-11-13 11:45:06,939][__main__][INFO] - agents played in iteration 568 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:45:07,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:07,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:07,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:07,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:07,495][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:07,495][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:08,207][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:08,833][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:09,164][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:09,490][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:09,820][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:10,150][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:10,482][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:10,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:11,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:11,802][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:12,128][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:12,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:14,090][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:14,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:14,756][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:15,081][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:15,732][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:16,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:17,037][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:17,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:18,013][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:18,340][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:18,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:19,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:20,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:20,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:20,074][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:20,949][__main__][INFO] - Iteration 569 took 23s (40.02% Gen, 56.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 49m 39s. Estimated total time: 19h 28m 10s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 41s. [2025-11-13 11:45:20,951][__main__][INFO] - Starting iteration 569. [2025-11-13 11:45:20,955][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:20,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:30,441][__main__][INFO] - Number of regex retries in iteration 569: 0 [2025-11-13 11:45:30,441][__main__][INFO] - agents played in iteration 569 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:45:30,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:30,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:30,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:31,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:31,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:31,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:31,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:32,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:32,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:32,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:33,336][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:33,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:34,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:35,308][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:35,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:35,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:36,295][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:36,628][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:36,953][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:37,285][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:37,612][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:37,936][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:38,264][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:38,913][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:39,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:39,571][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:39,899][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:40,224][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:40,879][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:41,529][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:42,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:42,845][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:43,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:43,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:43,578][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:44,527][__main__][INFO] - Iteration 570 took 23s (40.24% Gen, 55.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 59m 42s. Estimated total time: 19h 38m 37s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 26s. [2025-11-13 11:45:44,529][__main__][INFO] - Starting iteration 570. [2025-11-13 11:45:44,533][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:44,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:53,983][__main__][INFO] - Number of regex retries in iteration 570: 0 [2025-11-13 11:45:53,984][__main__][INFO] - agents played in iteration 570 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:45:54,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:54,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:54,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:54,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:54,551][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:54,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:55,286][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:55,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:55,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:56,236][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:56,565][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:56,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:57,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:57,541][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:57,874][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:58,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:58,523][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:58,855][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:59,181][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:59,508][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:59,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:00,160][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:00,485][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:01,135][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:01,462][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:02,119][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:03,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:03,744][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:04,395][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:04,720][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:05,047][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:05,372][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:05,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:06,382][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:07,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:07,100][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:07,101][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:08,855][__main__][INFO] - Iteration 571 took 24s (38.85% Gen, 53.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 36m 50s. Estimated total time: 20h 16m 9s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 32s, 500 more iterations: 3h 22m 41s. [2025-11-13 11:46:08,857][__main__][INFO] - Starting iteration 571. [2025-11-13 11:46:08,860][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:46:08,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:18,363][__main__][INFO] - Number of regex retries in iteration 571: 0 [2025-11-13 11:46:18,363][__main__][INFO] - agents played in iteration 571 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:46:18,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:18,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:18,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:18,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:18,915][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:18,915][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:19,642][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:20,266][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:20,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:20,918][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:21,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:21,579][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:21,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:22,234][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:22,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:22,896][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:23,238][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:23,566][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:24,224][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:24,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:24,892][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:25,871][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:26,198][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:26,523][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:26,848][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:27,175][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:27,826][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:28,475][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:28,801][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:29,126][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:29,452][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:29,777][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:30,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:30,768][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:31,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:31,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:31,486][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:32,375][__main__][INFO] - Iteration 572 took 23s (40.41% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 56m 4s. Estimated total time: 19h 35m 46s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 57s. [2025-11-13 11:46:32,377][__main__][INFO] - Starting iteration 572. [2025-11-13 11:46:32,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:46:32,382][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:41,919][__main__][INFO] - Number of regex retries in iteration 572: 0 [2025-11-13 11:46:41,919][__main__][INFO] - agents played in iteration 572 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:46:42,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:42,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:42,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:42,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:42,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:42,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:43,185][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:43,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:43,814][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:44,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:44,483][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:44,808][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:45,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:45,463][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:46,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:46,452][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:46,784][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:47,110][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:47,439][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:48,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:48,421][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:49,396][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:49,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:50,047][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:50,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:51,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:51,359][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:51,685][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:52,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:52,662][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:52,987][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:53,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:53,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:54,302][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:55,043][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:55,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:55,046][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:55,962][__main__][INFO] - Iteration 573 took 23s (40.44% Gen, 55.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 59m 0s. Estimated total time: 19h 39m 6s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 31s. [2025-11-13 11:46:55,964][__main__][INFO] - Starting iteration 573. [2025-11-13 11:46:55,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:46:55,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:04,956][__main__][INFO] - Number of regex retries in iteration 573: 0 [2025-11-13 11:47:04,956][__main__][INFO] - agents played in iteration 573 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:47:05,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:05,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:05,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:05,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:05,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:05,511][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:06,242][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:06,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:07,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:07,519][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:07,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:08,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:08,823][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:09,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:09,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:09,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:10,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:10,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:11,434][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:11,764][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:12,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:12,421][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:12,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:13,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:14,055][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:14,381][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:15,359][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:15,685][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:16,009][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:16,334][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:16,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:17,330][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:18,041][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:18,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:18,044][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:18,902][__main__][INFO] - Iteration 574 took 22s (39.19% Gen, 57.06% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 26m 19s. Estimated total time: 19h 6m 48s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 8s. [2025-11-13 11:47:18,904][__main__][INFO] - Starting iteration 574. [2025-11-13 11:47:18,907][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:18,908][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:27,540][__main__][INFO] - Number of regex retries in iteration 574: 0 [2025-11-13 11:47:27,540][__main__][INFO] - agents played in iteration 574 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:47:27,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:28,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:28,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:28,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:28,095][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:28,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:28,821][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:29,117][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:29,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:30,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:30,425][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:30,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:31,078][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:31,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:31,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:32,398][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:32,728][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:33,053][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:33,380][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:34,032][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:34,358][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:34,685][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:35,340][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:35,674][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:36,003][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:36,332][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:36,663][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:36,991][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:37,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:37,985][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:38,310][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:38,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:38,960][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:39,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:39,955][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:40,681][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:40,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:40,684][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:41,543][__main__][INFO] - Iteration 575 took 22s (38.13% Gen, 58.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 10m 58s. Estimated total time: 18h 51m 49s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 38s. [2025-11-13 11:47:41,545][__main__][INFO] - Starting iteration 575. [2025-11-13 11:47:41,549][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:41,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:50,500][__main__][INFO] - Number of regex retries in iteration 575: 0 [2025-11-13 11:47:50,500][__main__][INFO] - agents played in iteration 575 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:47:50,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:50,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:51,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:51,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:51,067][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:51,068][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:51,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:52,088][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:52,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:52,743][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:53,069][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:53,396][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:54,050][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:54,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:55,357][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:55,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:56,336][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:56,662][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:56,987][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:57,313][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:57,971][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:58,625][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:58,951][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:59,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:59,932][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:00,914][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:01,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:01,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:02,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:02,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:03,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:03,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:03,629][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:04,509][__main__][INFO] - Iteration 576 took 22s (38.98% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 26m 50s. Estimated total time: 19h 8m 5s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 20s. [2025-11-13 11:48:04,512][__main__][INFO] - Starting iteration 576. [2025-11-13 11:48:04,515][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:04,515][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:13,345][__main__][INFO] - Number of regex retries in iteration 576: 0 [2025-11-13 11:48:13,346][__main__][INFO] - agents played in iteration 576 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:48:13,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:13,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:13,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:13,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:13,896][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:13,896][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:15,270][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:15,597][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:15,924][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:16,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:16,577][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:17,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:17,571][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:17,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:18,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:19,214][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:19,869][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:20,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:20,854][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:21,182][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:21,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:22,168][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:22,823][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:23,485][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:23,812][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:24,138][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:24,462][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:24,788][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:25,112][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:25,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:26,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:26,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:26,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:26,839][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:27,741][__main__][INFO] - Iteration 577 took 23s (38.02% Gen, 58.09% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 39m 43s. Estimated total time: 19h 21m 21s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 33s. [2025-11-13 11:48:27,743][__main__][INFO] - Starting iteration 577. [2025-11-13 11:48:27,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:27,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:35,632][__main__][INFO] - Number of regex retries in iteration 577: 0 [2025-11-13 11:48:35,633][__main__][INFO] - agents played in iteration 577 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:48:36,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:36,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:36,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:36,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:36,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:36,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:37,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:37,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:38,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:38,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:38,891][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:39,218][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:40,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:40,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:40,853][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:41,840][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:42,168][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:42,497][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:42,822][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:43,474][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:43,801][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:44,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:44,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:45,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:45,451][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:45,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:46,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:46,432][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:46,758][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:47,086][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:47,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:48,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:48,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:48,806][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:48,807][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:49,670][__main__][INFO] - Iteration 578 took 21s (35.96% Gen, 60.09% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 34m 15s. Estimated total time: 18h 16m 15s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 32s, 500 more iterations: 3h 2m 42s. [2025-11-13 11:48:49,672][__main__][INFO] - Starting iteration 578. [2025-11-13 11:48:49,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:49,676][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:57,784][__main__][INFO] - Number of regex retries in iteration 578: 0 [2025-11-13 11:48:57,784][__main__][INFO] - agents played in iteration 578 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:48:58,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:58,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:58,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:58,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:58,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:58,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:59,100][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:59,406][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:59,727][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:00,380][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:00,706][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:01,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:02,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:02,341][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:02,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:02,994][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:03,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:04,630][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:04,961][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:05,287][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:05,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:06,282][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:06,608][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:06,933][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:08,575][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:09,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:09,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:10,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:10,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:11,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:11,002][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:11,881][__main__][INFO] - Iteration 579 took 22s (36.51% Gen, 59.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 47m 55s. Estimated total time: 18h 30m 17s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 0s, 500 more iterations: 3h 5m 2s. [2025-11-13 11:49:11,883][__main__][INFO] - Starting iteration 579. [2025-11-13 11:49:11,887][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:11,888][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:20,333][__main__][INFO] - Number of regex retries in iteration 579: 0 [2025-11-13 11:49:20,334][__main__][INFO] - agents played in iteration 579 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:49:20,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:20,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:20,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:20,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:20,911][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:20,911][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:21,649][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:21,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:22,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:23,264][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:23,922][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:25,230][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:25,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:26,208][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:26,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:27,193][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:27,516][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:27,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:28,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:28,499][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:28,828][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:29,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:29,484][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:29,814][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:30,138][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:30,466][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:31,120][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:31,445][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:31,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:32,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:32,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:33,528][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:33,529][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:33,531][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:34,416][__main__][INFO] - Iteration 580 took 22s (37.49% Gen, 58.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 45s. Estimated total time: 18h 46m 30s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 33s, 500 more iterations: 3h 7m 45s. [2025-11-13 11:49:34,418][__main__][INFO] - Starting iteration 580. [2025-11-13 11:49:34,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:34,422][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:42,930][__main__][INFO] - Number of regex retries in iteration 580: 0 [2025-11-13 11:49:42,930][__main__][INFO] - agents played in iteration 580 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:49:43,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:43,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:43,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:43,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:43,488][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:43,488][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:44,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:45,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:46,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:46,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:47,157][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:47,483][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:47,810][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:48,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:48,463][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:49,117][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:49,444][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:49,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:50,097][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:50,751][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:51,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:51,409][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:51,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:52,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:52,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:52,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:53,369][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:54,674][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:55,388][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:56,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:56,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:56,140][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:57,906][__main__][INFO] - Iteration 581 took 23s (36.23% Gen, 56.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 51m 8s. Estimated total time: 19h 34m 17s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 42s. [2025-11-13 11:49:57,908][__main__][INFO] - Starting iteration 581. [2025-11-13 11:49:57,912][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:49:57,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:07,158][__main__][INFO] - Number of regex retries in iteration 581: 0 [2025-11-13 11:50:07,159][__main__][INFO] - agents played in iteration 581 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:50:07,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:07,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:07,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:07,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:07,721][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:07,722][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:08,456][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:08,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:09,079][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:09,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:10,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:11,045][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:12,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:12,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:13,007][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:13,661][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:13,987][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:14,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:14,966][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:15,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:17,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:17,918][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:18,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:18,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:18,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:19,619][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:20,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:20,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:20,357][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:21,287][__main__][INFO] - Iteration 582 took 23s (39.55% Gen, 56.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 17s. Estimated total time: 19h 28m 48s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 48s. [2025-11-13 11:50:21,289][__main__][INFO] - Starting iteration 582. [2025-11-13 11:50:21,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:50:21,293][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:30,024][__main__][INFO] - Number of regex retries in iteration 582: 0 [2025-11-13 11:50:30,025][__main__][INFO] - agents played in iteration 582 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:50:30,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:30,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:30,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:30,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:30,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:30,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:31,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:31,636][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:32,297][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:32,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:32,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:33,609][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:34,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:34,924][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:35,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:35,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:35,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:36,241][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:36,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:37,222][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:37,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:37,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:38,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:38,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:38,856][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:39,182][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:39,513][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:39,841][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:40,167][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:40,493][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:40,825][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:41,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:41,473][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:41,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:42,516][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:43,252][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:43,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:43,255][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:44,159][__main__][INFO] - Iteration 583 took 22s (38.18% Gen, 57.85% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 27s. Estimated total time: 19h 3m 21s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 33s. [2025-11-13 11:50:44,161][__main__][INFO] - Starting iteration 583. [2025-11-13 11:50:44,164][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:50:44,165][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:52,689][__main__][INFO] - Number of regex retries in iteration 583: 0 [2025-11-13 11:50:52,690][__main__][INFO] - agents played in iteration 583 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:50:53,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:53,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:53,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:53,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:53,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:53,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:53,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:54,932][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:55,925][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:56,251][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:56,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:56,914][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:57,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:57,560][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:57,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:58,543][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:58,870][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:59,197][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:59,529][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:59,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:00,179][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:00,507][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:00,833][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:01,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:01,484][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:02,464][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:02,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:03,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:03,775][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:04,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:04,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:05,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:05,876][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:05,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:05,879][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:06,775][__main__][INFO] - Iteration 584 took 22s (37.70% Gen, 58.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 6m 18s. Estimated total time: 18h 50m 35s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 41s, 500 more iterations: 3h 8m 25s. [2025-11-13 11:51:06,777][__main__][INFO] - Starting iteration 584. [2025-11-13 11:51:06,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:06,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:14,808][__main__][INFO] - Number of regex retries in iteration 584: 0 [2025-11-13 11:51:14,809][__main__][INFO] - agents played in iteration 584 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:51:15,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:15,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:15,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:15,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:15,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:15,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:16,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:16,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:16,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:17,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:17,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:17,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:18,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:18,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:18,688][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:19,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:19,341][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:19,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:20,335][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:21,316][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:21,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:22,297][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:22,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:22,950][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:23,276][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:23,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:23,929][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:24,254][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:24,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:24,918][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:25,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:25,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:25,895][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:26,220][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:26,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:27,250][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:27,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:27,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:27,971][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:28,864][__main__][INFO] - Iteration 585 took 22s (36.35% Gen, 59.60% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 39m 34s. Estimated total time: 18h 24m 13s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 48s, 500 more iterations: 3h 4m 2s. [2025-11-13 11:51:28,867][__main__][INFO] - Starting iteration 585. [2025-11-13 11:51:28,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:28,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:37,793][__main__][INFO] - Number of regex retries in iteration 585: 0 [2025-11-13 11:51:37,794][__main__][INFO] - agents played in iteration 585 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:51:38,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:38,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:38,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:38,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:38,348][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:38,349][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:39,082][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:39,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:39,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:40,031][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:40,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:40,687][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:41,013][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:41,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:41,996][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:42,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:42,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:43,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:43,982][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:44,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:44,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:45,947][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:46,274][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:47,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:47,903][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:48,887][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:49,213][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:49,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:50,246][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:50,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:50,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:51,000][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:51,874][__main__][INFO] - Iteration 586 took 23s (38.79% Gen, 57.40% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 25m 14s. Estimated total time: 19h 10m 16s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 42s. [2025-11-13 11:51:51,876][__main__][INFO] - Starting iteration 586. [2025-11-13 11:51:51,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:51,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:00,767][__main__][INFO] - Number of regex retries in iteration 586: 0 [2025-11-13 11:52:00,768][__main__][INFO] - agents played in iteration 586 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:52:01,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:01,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:01,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:01,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:01,348][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:01,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:02,060][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:02,683][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:03,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:03,343][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:03,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:03,996][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:04,327][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:04,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:04,993][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:05,645][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:06,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:06,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:07,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:07,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:07,934][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:08,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:08,918][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:10,553][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:10,879][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:11,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:11,533][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:11,862][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:12,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:12,521][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:13,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:13,950][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:13,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:13,952][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:14,850][__main__][INFO] - Iteration 587 took 22s (38.69% Gen, 57.40% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 8s. Estimated total time: 19h 8m 33s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 25s. [2025-11-13 11:52:14,852][__main__][INFO] - Starting iteration 587. [2025-11-13 11:52:14,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:14,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:23,567][__main__][INFO] - Number of regex retries in iteration 587: 0 [2025-11-13 11:52:23,568][__main__][INFO] - agents played in iteration 587 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:52:24,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:24,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:24,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:24,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:24,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:24,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:24,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:25,149][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:25,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:25,803][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:26,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:26,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:26,795][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:27,780][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:28,113][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:28,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:28,769][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:29,420][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:30,073][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:30,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:31,051][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:32,033][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:32,359][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:32,686][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:33,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:33,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:33,666][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:34,320][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:34,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:34,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:35,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:36,023][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:36,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:36,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:36,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:37,636][__main__][INFO] - Iteration 588 took 22s (38.24% Gen, 57.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 13m 17s. Estimated total time: 18h 59m 5s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 50s. [2025-11-13 11:52:37,638][__main__][INFO] - Starting iteration 588. [2025-11-13 11:52:37,641][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:37,642][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:46,546][__main__][INFO] - Number of regex retries in iteration 588: 0 [2025-11-13 11:52:46,546][__main__][INFO] - agents played in iteration 588 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:52:46,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:47,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:47,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:47,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:47,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:47,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:48,123][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:48,452][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:48,778][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:49,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:49,757][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:50,084][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:50,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:51,067][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:51,720][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:52,373][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:53,681][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:54,007][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:54,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:54,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:54,987][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:55,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:56,295][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:56,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:57,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:57,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:57,932][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:58,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:58,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:59,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:59,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:59,742][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:00,643][__main__][INFO] - Iteration 589 took 23s (38.71% Gen, 57.36% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 59s. Estimated total time: 19h 10m 10s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 41s. [2025-11-13 11:53:00,645][__main__][INFO] - Starting iteration 589. [2025-11-13 11:53:00,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:00,649][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:09,028][__main__][INFO] - Number of regex retries in iteration 589: 0 [2025-11-13 11:53:09,029][__main__][INFO] - agents played in iteration 589 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:53:09,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:09,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:09,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:09,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:09,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:09,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:10,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:10,948][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:11,272][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:11,600][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:11,928][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:12,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:12,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:13,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:13,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:14,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:14,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:14,885][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:15,551][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:15,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:16,206][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:16,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:17,845][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:18,497][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:18,826][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:19,150][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:19,803][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:20,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:20,455][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:20,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:21,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:22,203][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:22,205][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:22,206][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:23,112][__main__][INFO] - Iteration 590 took 22s (37.30% Gen, 58.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 56m 38s. Estimated total time: 18h 43m 12s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 26s, 500 more iterations: 3h 7m 12s. [2025-11-13 11:53:23,114][__main__][INFO] - Starting iteration 590. [2025-11-13 11:53:23,118][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:23,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:31,648][__main__][INFO] - Number of regex retries in iteration 590: 0 [2025-11-13 11:53:31,648][__main__][INFO] - agents played in iteration 590 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:53:32,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:32,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:32,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:32,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:32,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:32,212][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:32,930][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:33,227][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:33,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:34,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:34,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:35,192][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:35,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:35,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:36,173][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:36,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:37,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:37,816][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:38,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:38,470][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:38,803][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:39,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:39,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:39,783][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:40,115][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:40,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:41,093][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:41,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:42,082][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:42,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:42,733][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:43,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:43,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:44,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:44,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:44,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:44,831][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:46,583][__main__][INFO] - Iteration 591 took 23s (36.35% Gen, 56.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 46m 22s. Estimated total time: 19h 33m 19s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 11:53:46,586][__main__][INFO] - Starting iteration 591. [2025-11-13 11:53:46,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:53:46,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:55,874][__main__][INFO] - Number of regex retries in iteration 591: 0 [2025-11-13 11:53:55,875][__main__][INFO] - agents played in iteration 591 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:53:56,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:56,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:56,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:56,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:56,427][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:56,428][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:57,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:57,481][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:57,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:58,464][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:58,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:59,118][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:00,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:01,074][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:01,401][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:01,730][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:02,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:03,039][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:03,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:04,349][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:05,327][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:05,653][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:05,979][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:06,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:06,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:06,957][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:07,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:08,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:09,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:09,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:09,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:09,951][__main__][INFO] - Iteration 592 took 23s (39.74% Gen, 56.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 40m 48s. Estimated total time: 19h 28m 8s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 41s. [2025-11-13 11:54:09,953][__main__][INFO] - Starting iteration 592. [2025-11-13 11:54:09,956][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:54:09,957][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:18,593][__main__][INFO] - Number of regex retries in iteration 592: 0 [2025-11-13 11:54:18,593][__main__][INFO] - agents played in iteration 592 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:54:19,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:19,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:19,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:19,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:19,148][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:19,149][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:19,886][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:20,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:20,842][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:21,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:21,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:22,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:22,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:23,462][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:24,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:24,441][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:24,769][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:25,098][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:25,425][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:25,751][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:26,079][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:26,737][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:27,066][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:27,721][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:28,373][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:29,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:30,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:30,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:31,045][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:31,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:31,766][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:31,768][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:32,593][__main__][INFO] - Iteration 593 took 22s (38.15% Gen, 58.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 4m 9s. Estimated total time: 18h 51m 52s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 38s. [2025-11-13 11:54:32,595][__main__][INFO] - Starting iteration 593. [2025-11-13 11:54:32,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:54:32,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:41,502][__main__][INFO] - Number of regex retries in iteration 593: 0 [2025-11-13 11:54:41,503][__main__][INFO] - agents played in iteration 593 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:54:41,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:42,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:42,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:42,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:42,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:42,082][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:43,091][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:43,747][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:44,401][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:44,733][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:45,387][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:45,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:46,053][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:47,035][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:47,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:48,349][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:48,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:49,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:49,332][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:49,663][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:50,312][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:50,979][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:51,299][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:51,626][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:52,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:52,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:52,928][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:53,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:53,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:54,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:54,665][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:54,666][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:55,538][__main__][INFO] - Iteration 594 took 22s (38.81% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 18m 56s. Estimated total time: 19h 7m 1s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 10s. [2025-11-13 11:54:55,540][__main__][INFO] - Starting iteration 594. [2025-11-13 11:54:55,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:54:55,543][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:04,464][__main__][INFO] - Number of regex retries in iteration 594: 0 [2025-11-13 11:55:04,465][__main__][INFO] - agents played in iteration 594 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:55:04,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:04,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:04,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:05,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:05,020][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:05,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:06,052][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:06,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:07,033][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:07,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:08,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:08,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:08,672][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:08,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:09,325][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:09,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:09,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:10,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:10,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:11,303][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:11,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:12,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:12,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:13,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:13,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:14,247][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:14,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:14,902][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:15,228][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:15,554][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:15,886][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:16,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:16,926][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:17,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:17,678][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:17,680][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:18,497][__main__][INFO] - Iteration 595 took 22s (38.86% Gen, 57.57% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 15s. Estimated total time: 19h 7m 44s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 17s. [2025-11-13 11:55:18,499][__main__][INFO] - Starting iteration 595. [2025-11-13 11:55:18,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:18,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:27,415][__main__][INFO] - Number of regex retries in iteration 595: 0 [2025-11-13 11:55:27,416][__main__][INFO] - agents played in iteration 595 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:55:27,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:27,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:27,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:27,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:27,974][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:27,975][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:28,998][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:30,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:31,286][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:31,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:32,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:32,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:33,576][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:33,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:34,229][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:34,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:34,887][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:35,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:35,870][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:36,197][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:37,177][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:38,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:38,812][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:39,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:39,863][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:40,563][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:40,564][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:40,566][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:41,403][__main__][INFO] - Iteration 596 took 22s (38.92% Gen, 57.42% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 14s. Estimated total time: 19h 5m 5s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 50s. [2025-11-13 11:55:41,405][__main__][INFO] - Starting iteration 596. [2025-11-13 11:55:41,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:41,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:49,906][__main__][INFO] - Number of regex retries in iteration 596: 0 [2025-11-13 11:55:49,907][__main__][INFO] - agents played in iteration 596 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:55:50,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:50,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:50,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:50,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:50,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:50,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:51,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:51,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:52,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:52,507][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:52,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:53,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:54,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:54,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:54,808][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:55,133][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:55,786][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:56,113][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:56,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:57,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:57,424][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:57,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:58,076][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:58,406][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:59,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:59,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:00,041][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:00,368][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:00,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:01,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:01,348][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:01,682][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:02,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:03,111][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:03,112][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:03,114][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:03,943][__main__][INFO] - Iteration 597 took 22s (37.71% Gen, 58.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 57m 32s. Estimated total time: 18h 46m 46s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 33s, 500 more iterations: 3h 7m 47s. [2025-11-13 11:56:03,945][__main__][INFO] - Starting iteration 597. [2025-11-13 11:56:03,948][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:03,948][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:13,075][__main__][INFO] - Number of regex retries in iteration 597: 0 [2025-11-13 11:56:13,076][__main__][INFO] - agents played in iteration 597 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:56:13,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:13,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:13,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:13,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:13,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:13,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:14,356][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:14,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:14,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:15,311][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:15,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:15,964][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:16,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:16,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:16,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:17,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:17,596][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:17,922][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:18,574][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:18,901][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:19,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:20,206][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:20,533][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:20,861][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:21,188][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:22,835][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:23,162][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:23,493][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:23,819][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:24,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:24,801][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:25,514][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:26,231][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:26,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:26,235][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:27,113][__main__][INFO] - Iteration 598 took 23s (39.40% Gen, 56.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 40s. Estimated total time: 19h 18m 18s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 3s. [2025-11-13 11:56:27,115][__main__][INFO] - Starting iteration 598. [2025-11-13 11:56:27,117][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:27,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:35,543][__main__][INFO] - Number of regex retries in iteration 598: 0 [2025-11-13 11:56:35,544][__main__][INFO] - agents played in iteration 598 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:56:35,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:36,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:36,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:36,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:36,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:36,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:36,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:37,123][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:37,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:37,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:38,427][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:39,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:39,744][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:40,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:41,380][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:41,707][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:42,032][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:42,358][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:42,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:43,337][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:43,665][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:43,991][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:44,317][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:44,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:44,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:45,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:45,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:45,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:46,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:46,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:47,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:47,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:48,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:48,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:48,704][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:49,524][__main__][INFO] - Iteration 599 took 22s (37.60% Gen, 58.73% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 50m 23s. Estimated total time: 18h 40m 23s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 20s, 500 more iterations: 3h 6m 43s. [2025-11-13 11:56:49,526][__main__][INFO] - Starting iteration 599. [2025-11-13 11:56:49,530][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:49,531][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:58,464][__main__][INFO] - Number of regex retries in iteration 599: 0 [2025-11-13 11:56:58,465][__main__][INFO] - agents played in iteration 599 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:56:58,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:58,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:58,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:59,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:59,036][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:59,036][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:59,760][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:00,057][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:01,370][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:01,703][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:02,024][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:02,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:02,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:03,003][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:03,654][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:04,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:04,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:04,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:06,926][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:07,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:07,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:07,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:08,237][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:08,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:08,891][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:09,222][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:09,556][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:10,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:10,952][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:11,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:11,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:11,647][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:12,484][__main__][INFO] - Iteration 600 took 22s (38.92% Gen, 57.43% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 21s. Estimated total time: 19h 7m 44s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 17s. [2025-11-13 11:57:12,486][__main__][INFO] - Starting iteration 600. [2025-11-13 11:57:12,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:12,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:21,479][__main__][INFO] - Number of regex retries in iteration 600: 0 [2025-11-13 11:57:21,480][__main__][INFO] - agents played in iteration 600 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:57:21,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:21,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:21,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:22,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:22,037][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:22,037][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:23,059][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:23,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:24,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:24,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:25,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:25,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:26,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:26,660][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:26,994][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:27,321][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:27,646][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:27,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:28,303][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:28,629][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:28,956][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:29,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:29,610][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:29,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:30,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:30,589][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:30,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:31,244][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:31,570][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:31,896][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:32,224][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:32,551][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:32,880][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:33,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:33,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:34,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:34,615][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:34,617][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:36,315][__main__][INFO] - Iteration 601 took 23s (37.73% Gen, 55.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 0m 35s. Estimated total time: 19h 51m 22s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 33s. [2025-11-13 11:57:36,317][__main__][INFO] - Starting iteration 601. [2025-11-13 11:57:36,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:57:36,321][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:45,138][__main__][INFO] - Number of regex retries in iteration 601: 0 [2025-11-13 11:57:45,139][__main__][INFO] - agents played in iteration 601 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:57:45,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:45,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:45,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:45,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:45,703][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:45,704][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:46,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:46,876][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:47,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:48,181][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:48,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:49,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:50,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:51,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:51,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:51,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:52,445][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:53,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:53,755][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:54,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:54,408][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:55,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:55,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:55,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:56,044][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:56,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:57,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:57,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:58,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:58,496][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:58,498][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:59,368][__main__][INFO] - Iteration 602 took 23s (38.26% Gen, 57.96% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 21m 15s. Estimated total time: 19h 12m 25s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 4s. [2025-11-13 11:57:59,370][__main__][INFO] - Starting iteration 602. [2025-11-13 11:57:59,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:57:59,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:08,576][__main__][INFO] - Number of regex retries in iteration 602: 0 [2025-11-13 11:58:08,577][__main__][INFO] - agents played in iteration 602 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:58:09,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:09,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:09,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:09,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:09,157][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:09,157][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:09,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:10,201][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:10,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:11,512][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:12,164][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:12,819][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:13,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:13,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:13,798][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:14,124][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:14,452][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:14,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:15,110][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:15,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:15,764][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:16,089][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:16,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:17,068][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:17,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:17,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:18,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:18,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:19,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:19,366][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:20,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:20,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:21,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:21,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:21,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:21,797][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:22,779][__main__][INFO] - Iteration 603 took 23s (39.32% Gen, 56.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 46s. Estimated total time: 19h 30m 19s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 3s. [2025-11-13 11:58:22,781][__main__][INFO] - Starting iteration 603. [2025-11-13 11:58:22,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:58:22,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:31,766][__main__][INFO] - Number of regex retries in iteration 603: 0 [2025-11-13 11:58:31,767][__main__][INFO] - agents played in iteration 603 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:58:32,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:32,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:32,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:32,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:32,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:32,346][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:33,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:33,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:34,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:34,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:35,036][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:35,365][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:35,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:36,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:36,676][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:37,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:37,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:37,655][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:37,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:38,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:38,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:38,959][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:39,285][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:39,613][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:39,945][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:40,264][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:41,252][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:41,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:41,902][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:42,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:42,885][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:43,213][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:43,538][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:44,251][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:44,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:44,954][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:44,955][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:45,805][__main__][INFO] - Iteration 604 took 23s (39.01% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 9s. Estimated total time: 19h 11m 5s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 50s. [2025-11-13 11:58:45,808][__main__][INFO] - Starting iteration 604. [2025-11-13 11:58:45,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:58:45,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:54,446][__main__][INFO] - Number of regex retries in iteration 604: 0 [2025-11-13 11:58:54,446][__main__][INFO] - agents played in iteration 604 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:58:54,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:54,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:54,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:54,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:54,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:54,999][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:56,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:56,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:56,689][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:57,020][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:57,345][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:57,671][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:57,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:58,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:58,655][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:58,983][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:59,639][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:59,965][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:00,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:00,619][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:01,274][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:01,603][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:01,931][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:02,590][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:02,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:03,241][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:03,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:04,546][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:04,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:05,197][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:05,523][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:05,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:06,176][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:06,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:07,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:07,611][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:07,613][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:09,280][__main__][INFO] - Iteration 605 took 23s (36.78% Gen, 56.10% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 41m 7s. Estimated total time: 19h 33m 26s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 34s. [2025-11-13 11:59:09,281][__main__][INFO] - Starting iteration 605. [2025-11-13 11:59:09,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:09,285][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:17,620][__main__][INFO] - Number of regex retries in iteration 605: 0 [2025-11-13 11:59:17,621][__main__][INFO] - agents played in iteration 605 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:59:18,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:18,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:18,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:19,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:19,579][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:19,905][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:20,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:20,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:21,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:21,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:21,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:22,201][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:22,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:22,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:23,182][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:23,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:23,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:24,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:25,138][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:25,790][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:26,442][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:26,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:27,094][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:27,421][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:27,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:28,072][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:29,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:29,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:30,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:30,842][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:30,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:30,845][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:32,482][__main__][INFO] - Iteration 606 took 23s (35.93% Gen, 57.01% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 27m 10s. Estimated total time: 19h 19m 53s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 18s. [2025-11-13 11:59:32,484][__main__][INFO] - Starting iteration 606. [2025-11-13 11:59:32,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:32,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:41,130][__main__][INFO] - Number of regex retries in iteration 606: 0 [2025-11-13 11:59:41,133][__main__][INFO] - agents played in iteration 606 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 11:59:41,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:41,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:42,792][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:43,089][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:43,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:44,396][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:44,721][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:45,051][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:45,375][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:45,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:46,030][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:46,363][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:47,015][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:47,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:47,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:47,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:48,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:48,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:48,974][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:49,302][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:49,629][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:49,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:50,282][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:50,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:50,934][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:51,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:51,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:52,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:52,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:52,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:53,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:53,932][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:54,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:54,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:54,658][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:55,565][__main__][INFO] - Iteration 607 took 23s (37.46% Gen, 58.59% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 20m 52s. Estimated total time: 19h 13m 58s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 19s. [2025-11-13 11:59:55,567][__main__][INFO] - Starting iteration 607. [2025-11-13 11:59:55,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:55,571][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:04,435][__main__][INFO] - Number of regex retries in iteration 607: 0 [2025-11-13 12:00:04,437][__main__][INFO] - agents played in iteration 607 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:00:04,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:04,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:04,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:05,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:05,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:05,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:05,763][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:06,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:06,388][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:06,715][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:07,703][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:08,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:08,357][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:08,682][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:09,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:09,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:09,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:10,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:10,645][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:10,972][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:11,303][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:11,956][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:12,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:12,620][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:12,946][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:13,274][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:13,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:13,930][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:14,915][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:15,568][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:16,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:16,960][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:17,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:17,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:17,718][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:19,223][__main__][INFO] - Iteration 608 took 23s (37.48% Gen, 56.15% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 49m 9s. Estimated total time: 19h 42m 39s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 6s. [2025-11-13 12:00:19,225][__main__][INFO] - Starting iteration 608. [2025-11-13 12:00:19,229][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:19,229][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:28,252][__main__][INFO] - Number of regex retries in iteration 608: 0 [2025-11-13 12:00:28,252][__main__][INFO] - agents played in iteration 608 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:00:28,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,159][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:29,159][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:30,224][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:30,550][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:30,878][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:31,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:31,537][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:31,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:32,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:32,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:33,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:33,826][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:34,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:34,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:35,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:36,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:36,439][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:36,765][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:37,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:37,743][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:38,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:38,394][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:38,729][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:39,054][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:39,385][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:39,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:40,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:41,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:41,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:41,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:41,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:43,282][__main__][INFO] - Iteration 609 took 24s (37.51% Gen, 56.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 8m 50s. Estimated total time: 20h 2m 43s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 27s. [2025-11-13 12:00:43,285][__main__][INFO] - Starting iteration 609. [2025-11-13 12:00:43,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:43,288][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:52,679][__main__][INFO] - Number of regex retries in iteration 609: 0 [2025-11-13 12:00:52,679][__main__][INFO] - agents played in iteration 609 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:00:53,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,251][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:53,252][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:53,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:54,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:54,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:55,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:55,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:56,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:56,907][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:57,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:58,229][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:58,559][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:58,884][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:59,210][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:59,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:00,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:00,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:01,506][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:01,833][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:02,157][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:02,484][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:02,812][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:03,137][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:03,464][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:03,801][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:04,131][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:04,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:05,170][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:05,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:05,894][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:05,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:06,932][__main__][INFO] - Iteration 610 took 23s (39.71% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 47m 58s. Estimated total time: 19h 42m 16s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 12:01:06,934][__main__][INFO] - Starting iteration 610. [2025-11-13 12:01:06,938][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:06,938][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:16,184][__main__][INFO] - Number of regex retries in iteration 610: 0 [2025-11-13 12:01:16,185][__main__][INFO] - agents played in iteration 610 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:01:16,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:16,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:16,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:16,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:16,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:16,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:17,807][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:18,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:18,786][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:19,765][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:21,079][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:21,734][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:22,068][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:22,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:22,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:23,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:24,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:24,350][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:24,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:25,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:26,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:26,968][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:27,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:27,622][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:27,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:28,668][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:29,399][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:29,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:29,402][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:32,409][__main__][INFO] - Iteration 611 took 25s (36.30% Gen, 51.89% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 18m 54s. Estimated total time: 21h 13m 37s. Time estimates for 10 more iterations: 4m 14s, 100 more iterations: 42m 27s, 500 more iterations: 3h 32m 16s. [2025-11-13 12:01:32,411][__main__][INFO] - Starting iteration 611. [2025-11-13 12:01:32,414][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:01:32,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:41,356][__main__][INFO] - Number of regex retries in iteration 611: 0 [2025-11-13 12:01:41,357][__main__][INFO] - agents played in iteration 611 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:01:41,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:41,924][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:42,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:42,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:43,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:43,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:44,621][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:45,286][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:45,612][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:45,939][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:46,592][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:47,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:47,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:48,877][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:49,203][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:49,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:49,852][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:50,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:50,506][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:50,831][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:51,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:51,814][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:52,143][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:53,127][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:53,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:54,578][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:54,580][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:54,582][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:55,543][__main__][INFO] - Iteration 612 took 23s (38.66% Gen, 57.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 21m 24s. Estimated total time: 19h 16m 30s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 45s. [2025-11-13 12:01:55,545][__main__][INFO] - Starting iteration 612. [2025-11-13 12:01:55,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:01:55,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:04,232][__main__][INFO] - Number of regex retries in iteration 612: 0 [2025-11-13 12:02:04,232][__main__][INFO] - agents played in iteration 612 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:02:04,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:04,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:04,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:04,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:04,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:04,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:05,526][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:05,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:06,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:06,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:06,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:07,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:07,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:07,784][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:08,110][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:09,095][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:09,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:09,750][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:10,076][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:10,403][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:10,730][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:11,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:11,710][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:12,370][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:12,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:13,030][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:13,356][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:13,683][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:14,008][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:14,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:14,663][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:14,990][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:15,316][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:15,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:15,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:16,683][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:17,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:17,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:17,402][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:18,420][__main__][INFO] - Iteration 613 took 22s (37.96% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 8m 8s. Estimated total time: 19h 3m 37s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 36s. [2025-11-13 12:02:18,422][__main__][INFO] - Starting iteration 613. [2025-11-13 12:02:18,426][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:02:18,426][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:28,149][__main__][INFO] - Number of regex retries in iteration 613: 0 [2025-11-13 12:02:28,150][__main__][INFO] - agents played in iteration 613 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:02:28,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,712][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:28,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:29,458][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:29,756][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:30,083][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:30,408][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:30,735][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:31,060][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:31,385][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:31,712][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:32,037][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:32,364][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:32,692][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:33,018][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:33,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:33,671][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:35,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:35,970][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:36,624][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:36,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:37,282][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:37,602][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:38,906][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:39,232][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:39,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:40,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:41,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:41,373][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:41,374][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:42,491][__main__][INFO] - Iteration 614 took 24s (40.40% Gen, 54.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 26s. Estimated total time: 20h 3m 19s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 6s, 500 more iterations: 3h 20m 33s. [2025-11-13 12:02:42,493][__main__][INFO] - Starting iteration 614. [2025-11-13 12:02:42,497][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:02:42,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:51,409][__main__][INFO] - Number of regex retries in iteration 614: 0 [2025-11-13 12:02:51,410][__main__][INFO] - agents played in iteration 614 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:02:51,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:51,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:51,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:51,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:51,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:51,998][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:52,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:54,034][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:54,360][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:55,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:55,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:55,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:56,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:56,647][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:57,298][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:57,624][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:57,951][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:58,277][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:58,606][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:58,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:59,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:59,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:00,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:00,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:01,220][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:01,548][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:01,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:02,207][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:02,534][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:02,859][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:03,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:03,849][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:04,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:04,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:04,568][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:05,963][__main__][INFO] - Iteration 615 took 23s (37.98% Gen, 56.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 37m 4s. Estimated total time: 19h 33m 20s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 12:03:05,964][__main__][INFO] - Starting iteration 615. [2025-11-13 12:03:05,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:05,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:15,290][__main__][INFO] - Number of regex retries in iteration 615: 0 [2025-11-13 12:03:15,291][__main__][INFO] - agents played in iteration 615 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:03:15,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:15,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:15,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:15,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:15,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:15,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:16,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:17,223][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:18,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:18,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:18,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:19,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:19,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:19,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:20,165][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:20,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:20,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:21,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:21,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:21,803][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:22,128][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:22,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:23,432][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:23,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:24,086][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:24,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:25,062][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:25,395][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:26,053][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:26,380][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:26,707][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:27,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:27,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:28,438][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:28,439][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:28,441][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:29,441][__main__][INFO] - Iteration 616 took 23s (39.71% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 37m 4s. Estimated total time: 19h 33m 44s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 37s. [2025-11-13 12:03:29,444][__main__][INFO] - Starting iteration 616. [2025-11-13 12:03:29,447][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:29,448][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:38,105][__main__][INFO] - Number of regex retries in iteration 616: 0 [2025-11-13 12:03:38,106][__main__][INFO] - agents played in iteration 616 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:03:38,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:38,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:38,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:38,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:38,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:38,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:40,056][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:41,361][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:42,014][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:42,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:43,324][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:43,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:44,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:44,627][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:44,953][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:45,279][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:45,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:45,931][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:46,598][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:47,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:47,591][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:48,248][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:48,578][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:48,911][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:49,238][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:49,582][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:49,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:50,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:51,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:51,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:51,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:52,543][__main__][INFO] - Iteration 617 took 23s (37.48% Gen, 57.34% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 48s. Estimated total time: 19h 14m 51s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 28s. [2025-11-13 12:03:52,545][__main__][INFO] - Starting iteration 617. [2025-11-13 12:03:52,549][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:52,550][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:01,663][__main__][INFO] - Number of regex retries in iteration 617: 0 [2025-11-13 12:04:01,663][__main__][INFO] - agents played in iteration 617 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:04:02,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:02,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:02,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:02,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:02,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:02,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:02,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:04,230][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:04,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:05,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:05,535][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:06,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:06,520][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:07,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:08,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:08,483][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:08,808][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:09,464][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:09,790][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:10,121][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:11,448][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:11,773][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:12,105][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:12,430][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:12,756][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:13,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:14,089][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:14,815][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:14,817][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:14,819][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:15,724][__main__][INFO] - Iteration 618 took 23s (39.32% Gen, 56.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 21m 23s. Estimated total time: 19h 18m 49s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 8s. [2025-11-13 12:04:15,727][__main__][INFO] - Starting iteration 618. [2025-11-13 12:04:15,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:15,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:24,255][__main__][INFO] - Number of regex retries in iteration 618: 0 [2025-11-13 12:04:24,256][__main__][INFO] - agents played in iteration 618 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:04:24,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:24,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:24,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:24,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:24,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:24,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:25,548][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:25,843][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:26,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:26,501][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:26,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:27,153][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:27,810][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:28,136][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:28,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:28,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:29,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:29,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:29,778][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:30,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:30,443][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:31,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:32,091][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:32,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:32,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:33,744][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:34,069][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:34,403][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:34,734][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:35,387][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:35,717][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:36,042][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:36,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:37,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:37,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:37,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:39,299][__main__][INFO] - Iteration 619 took 23s (36.17% Gen, 55.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 40m 41s. Estimated total time: 19h 38m 30s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 25s. [2025-11-13 12:04:39,301][__main__][INFO] - Starting iteration 619. [2025-11-13 12:04:39,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:39,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:47,738][__main__][INFO] - Number of regex retries in iteration 619: 0 [2025-11-13 12:04:47,738][__main__][INFO] - agents played in iteration 619 are Alice, Bob_buffer, Bob, Alice_buffer [2025-11-13 12:04:48,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:48,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:48,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:48,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:48,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:48,302][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:49,043][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:49,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:49,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:49,996][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:50,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:51,301][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:51,628][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:51,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:52,283][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:52,937][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:53,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:54,249][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:54,908][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:55,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:55,564][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:55,891][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:56,219][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:56,547][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:56,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:57,548][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:57,874][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:58,208][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:58,535][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:58,860][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:59,193][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:59,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:05:00,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:05:00,905][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:05:00,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:05:00,909][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed9999_bs128/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:05:02,198][__main__][INFO] - Iteration 620 took 22s (36.83% Gen, 57.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 6m 30s. Estimated total time: 19h 4m 43s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 47s. [2025-11-13 12:05:02,200][__main__][INFO] - Starting iteration 620. [2025-11-13 12:05:02,204][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:05:02,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:05:12,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,009][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,010][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,011][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,012][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,013][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,014][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,015][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,016][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,017][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,018][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,057][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,071][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,072][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,073][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,074][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,075][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,076][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,077][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,078][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,079][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,080][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,081][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,082][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,083][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,084][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,085][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,086][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,087][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,088][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,089][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,090][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,091][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,092][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,093][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,094][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,095][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,096][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,097][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,098][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,099][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,100][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,101][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,102][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,103][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,104][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,542][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,543][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,544][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,545][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,546][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,547][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,548][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,549][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,550][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,551][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,552][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,553][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,554][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,555][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,556][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,557][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,558][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,559][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,560][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,561][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,562][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,563][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,564][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,565][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,566][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,567][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,568][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,569][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,570][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,571][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,572][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,573][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:13,909][asyncio][WARNING] - socket.send() raised exception.